CN116149917A

CN116149917A - Method and apparatus for evaluating processor performance, computing device, and readable storage medium

Info

Publication number: CN116149917A
Application number: CN202310194425.XA
Authority: CN
Inventors: 吴楠; 李�根; 唐遇星; 杨耀; 余志拥
Original assignee: Phytium Technology Co Ltd
Current assignee: Phytium Technology Co Ltd
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-05-23

Abstract

The application provides a method and a device for evaluating performance of a processor, a computing device and a readable storage medium, wherein the method comprises the following steps: acquiring execution time of a plurality of program segments in a benchmark test program; and calling a simulation point tool, and evaluating the performance of the processor according to the execution time of the program segments. In the embodiment of the application, the execution time of a plurality of program segments is considered in the process of evaluating the performance of the processor by adopting the simulation point tool, so that the error of evaluating the performance of the processor by adopting the simulation point tool is reduced.

Description

Method and apparatus for evaluating processor performance, computing device, and readable storage medium

Technical Field

The present application relates to the field of computer technology, and more particularly, to a method and apparatus for evaluating performance of a processor, a computing device, and a readable storage medium.

Background

The simulation point (simple) tool can utilize the characteristic that the program has repetitive behaviors which change with time in the execution process, and can reduce the execution time of the benchmark test program by identifying the repetitive behaviors and taking a sample for each repetitive behavior as a representative, thereby solving the time-consuming problem of repeatedly executing the benchmark test program in the processor performance evaluation process.

However, in practical applications, it is found that the execution time of the benchmark program obtained based on the simulation point tool and the time of actually executing the complete benchmark program have large errors.

Disclosure of Invention

The application provides a method and a device for evaluating performance of a processor, computing equipment and a readable storage medium. Various aspects related to embodiments of the present application are described below.

In a first aspect, there is provided a method of evaluating performance of a processor, the method comprising: acquiring execution time of a plurality of program segments in a benchmark test program; and calling a simulation point tool, and evaluating the performance of the processor according to the execution time of the program segments.

As one possible implementation manner, the invoking the simulation site tool evaluates the performance of the processor according to the execution time of the plurality of program segments, including: invoking the simulation point tool, and clustering the program segments based on the execution time of the program segments; and evaluating the performance of the processor according to the clustering result of the program segments.

As a possible implementation manner, the calling the simulation point tool, based on the execution time of the plurality of program segments, clusters the plurality of program segments, includes: dividing the plurality of program segments into one or more program segment groups according to the execution time of the plurality of program segments; and calling the simulation point tool to cluster the program segments included in each program segment group of the one or more program segment groups respectively.

As a possible implementation manner, the dividing the plurality of program segments into one or more program segment groups according to the execution time of the plurality of program segments includes: sequencing the program segments according to the execution time of the program segments; the plurality of program segments are divided into one or more program segment groups based on a result of the ordering of the plurality of program segments.

As a possible implementation manner, the calling the simulation point tool separately clusters the program segments included in each of the one or more program segment groups, including: and calling the simulation point tool, and clustering the program segments included in each program segment group of the one or more program segment groups based on the execution time of the program segments.

As a possible implementation manner, the calling the simulation point tool, based on the execution time of the plurality of program segments, clusters the plurality of program segments, includes: constructing a basic block vector of each program segment in the plurality of program segments, wherein the basic block vector comprises a time characteristic value of the program segment, and the time characteristic value is used for indicating the execution time of the program segment; and calling the simulation point tool, and clustering the program segments based on the basic block vector.

As one possible implementation manner, the invoking the simulation site tool evaluates the performance of the processor according to the execution time of the plurality of program segments, including: invoking the simulation point tool to obtain a clustering result of the program segments, wherein the clustering result comprises one or more program segment sample representations; determining weights represented by the one or more program segment samples based on execution times of the plurality of program segments; the performance of the processor is evaluated based on the weights represented by the one or more program segment samples.

As a possible implementation manner, the obtaining the execution time of the plurality of program segments in the benchmark test program includes: obtaining cache miss information of each program segment in a plurality of program segments in the benchmark test program; and determining the execution time of a plurality of program segments in the benchmark test program based on the cache miss information.

As one possible implementation manner, each program segment of the plurality of program segments includes an instruction corresponding to the cache miss information and other instructions, and the determining, based on the cache miss information, execution time of the plurality of program segments in the benchmark test program includes: and determining the execution time of the program segments based on the product of the number of the instructions corresponding to the cache miss information and the first execution time and the product of the number of the other instructions and the second execution time, wherein the first execution time is larger than the second execution time.

As a possible implementation manner, the obtaining cache miss information of each of the plurality of program segments in the benchmark test program includes: and based on the three-level cache model, counting the cache miss information of each program segment in the program segments in the benchmark test program.

In a second aspect, there is provided an apparatus for evaluating performance of a processor, the apparatus comprising: the acquisition module is used for acquiring the execution time of a plurality of program segments in the benchmark test program; and the evaluation module is used for calling a simulation point tool and evaluating the performance of the processor according to the execution time of the program segments.

As a possible implementation manner, the evaluation module is configured to: invoking the simulation point tool, and clustering the program segments based on the execution time of the program segments; and evaluating the performance of the processor according to the clustering result of the program segments.

As a possible implementation manner, the evaluation module is configured to: invoking the simulation point tool to obtain a clustering result of the program segments, wherein the clustering result comprises one or more program segment sample representations; determining weights represented by the one or more program segment samples based on execution times of the plurality of program segments; the performance of the processor is evaluated based on the weights represented by the one or more program segment samples.

In a third aspect, a computing device is provided, comprising: a memory for storing codes; a processor for executing code stored in the memory for performing the method as described in the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, a computer readable storage medium is provided, on which code for performing the method according to the first aspect or any one of the possible implementations of the first aspect is stored.

In a fifth aspect, there is provided computer program code comprising instructions for performing the method as described in the first aspect or any one of the possible implementations of the first aspect.

The embodiment of the application is beneficial to reducing errors of performance evaluation of the processor by adopting the simulation point tool by considering execution time of a plurality of program segments in the process of evaluating the performance of the processor by adopting the simulation point tool.

Drawings

Fig. 1 is a flowchart of a method for evaluating performance of a processor according to an embodiment of the present application.

Fig. 2 is a flowchart of another method for estimating performance of a processor according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of an apparatus for evaluating performance of a processor according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a computing device according to an embodiment of the present application.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments.

In the design of modern computer architectures, it is often necessary to evaluate the performance of the different architectures to improve the performance of the processor.

As one implementation, the performance and correctness of a processor design may be predicted by simulating the execution of a benchmark program (benchmark). For example, the benchmark program may employ a benchmark program provided in a CPU intensive benchmark suite of standard performance assessment company (standard performance evaluation corporation, SPEC), abbreviated as SPEC CPU program.

To understand the periodic behavior of a processor during execution of an application program, each cycle is often required to be simulated in detail using a simulator, but unfortunately, executing a SPEC CPU program in its entirety takes weeks or even months, even on the fastest simulator. Worse, it is often necessary in the architectural research process to simulate each benchmark program over different architectural configurations and designs to find a balance between performance, complexity, area and power consumption. For example, to find the impact of cache (cache) size on architecture performance, it may be necessary to execute the same application hundreds of thousands of times, which would take time unacceptable for product design.

The simulation point tool is an important tool for architecture research, and can solve the problem that the architecture simulation time is too long by adopting a tool for program analysis and machine learning. Simpliant can take advantage of the fact that programs have repetitive behaviors that vary over time during execution, by identifying these repetitive behaviors and then taking a sample for each repetitive behavior as a representation, the execution time of the program is reduced. That is, in the process of executing the benchmark program, the plurality of programs of repetitive behaviors in the benchmark program are replaced with sample representations selected by the simple, thereby reducing the execution time of the programs.

For ease of understanding, a brief description of the process of architectural modeling using a simple tool and related concepts is provided below.

The instruction stream of the benchmark program is typically segmented, such as aliquoting, into a plurality of program segments (intervals) prior to employing the simplet tool.

Each program segment may include one or more basic blocks (basic blocks), where a basic block refers to a segment of a program having only one entry and one exit, and a typical feature of a basic block is: all instructions within the basic block are executed in order only once, as long as the first instruction in the basic block is executed.

The basic block vector may indicate information of basic blocks in the program segment, wherein the number of elements included in the basic block vector may be the number of basic blocks in the program segment, and the value of each element may be a product of the number of times each basic block is executed and the number of instructions contained in the basic block during the execution of the program segment.

If the value of each basic block in the basic block vector of two program segments is identical, it is statistically identical to run both program segments and to run one of the program segments twice. Based on this principle, the simulation point tool clusters a plurality of program segments contained in the instruction stream of the benchmark program with the basic block vector as a feature, for example, a classical K-means clustering algorithm (K-means) is used, and small simulation points (simulation points), that is, sample representations of the program segments, can be selected. In addition, in some cases, program segments that are not clustered may be aligned into a similar cluster center.

After clustering, the simple tool will typically output two files: simulation points files and weight (weights) files. The number of the selected interval (i.e. the representative of the program segment sample) is stored in the simulation point file, and the weights represented by the program segments are contained in the weights. Based on the weight file output by the simple tool, and the runtime represented by the program segment samples, the runtime of the benchmark program may be calculated to evaluate the performance of the processor.

However, we have found in practice that the execution time of the simulation on the field programmable gate array (field programmable gate array, FPGA) and the time of actually executing the complete program represented by the sample of program segments selected by the simple have significant errors, especially when there are a large number of load/store instructions in the program, such as gcc.403_2 of Spec2006, which is a benchmark test program, with an error rate of up to 21.17%. Such errors are unacceptable in our practical design and the analysis may lead to the fact that the cause of such errors may include the following two points.

One aspect is that, as described above, the simple tool builds a basic block vector characterized by the number of code executions per basic block times the number of code pieces, i.e., the number of instructions per basic block instruction stream. This approach treats the time delay (i.e., execution time) of each instruction as identical (i.e., the time delay of all instructions is one beat), but is neither practical nor architectural knowledge.

On the other hand, because the simple clustering only considers the logic repeatability of the program, although the instruction number of each program segment is actually the same (in practical use, the instruction stream of the benchmark test program is generally equally divided into a plurality of program segments), the execution time of different instructions is ignored, the actual execution time of different program segments is quite different, especially the time delay is 100 to 400 times that of other instructions in the case of cache miss, and the omission is not acceptable when using the simulation point result to estimate the performance.

In order to solve the above-mentioned problems, embodiments of the present application provide a method for evaluating performance of a processor, which is capable of reducing errors in evaluating performance of the processor by using a simulation point tool by considering execution time of a plurality of program segments in evaluating performance of the processor by using the simulation point tool.

Referring to fig. 1, the method 100 includes step S110 and step S120.

In step S110, execution times of a plurality of program segments in the benchmark program are acquired.

To identify repetitive behavior in a benchmark program, the benchmark program may first be partitioned into multiple units, e.g., the program stream of the benchmark program may be partitioned into multiple program segments. As an implementation, the instruction stream of the benchmark program may be segmented according to a fixed length, such as a length of 100M instructions, to obtain a plurality of program segments.

In actual use, the simulator is used for running the binary program of the benchmark test program, and then a pin (pin) tool is used for tracking (trace) the instruction flow of the whole benchmark test program, so that the instruction flow can be segmented into fragments with fixed sizes according to the fixed instruction flow size, and a plurality of program segments are obtained.

Methods of acquiring execution time of a plurality of program segments include a plurality of. For example, it is possible to run a plurality of program segments and count the execution time of each program segment.

To reduce the complexity of acquiring the execution time of the plurality of program segments, the execution time of the plurality of program segments may also be estimated. Through theoretical analysis and experimental verification, it can be found that the execution time of most instructions is one clock cycle, and when executing memory instructions, the execution time is mostly greater than one clock cycle, especially in the case of a three-level cache (L3 cache) miss, the execution of memory instructions may require about 100 to 400 clock cycles, which is also a main cause of a simple tool error. Thus, as one implementation, the execution time of each program segment may be estimated based on cache miss information for the program segment. Further, since errors caused by the first-level cache miss and the second-level cache miss are small, in order to reduce the complexity of the estimation, the execution time of each program segment may be estimated based on the third-level cache miss information of the program segment.

In step S120, a simulation point tool is invoked to evaluate the performance of the processor based on the execution time of the plurality of program segments.

In combination with the method for estimating the performance of the processor by using the simulation point tool, the execution time of the program segments can be considered from multiple aspects, for example, the program segments can be clustered based on the execution time of the program segments, and the performance of the processor can be estimated according to the clustering result of the program segments; the execution time of the plurality of program segments may also be considered in estimating the execution time of the benchmark program based on the execution time represented by the program segments, that is, the performance of the processor may be estimated based on the execution time of the plurality of program segments and the clustering result.

According to the embodiment of the application, in the process of evaluating the performance of the processor by adopting the simulation point tool, the execution time of a plurality of program segments is considered, so that the influence of different execution time of the program segments on the performance estimation result of the processor can be reduced, and the estimation error of the simulation point tool is reduced.

As an implementation, according to the execution time of the program segments, the program segments may be divided into one or more program segment groups, where the execution time of the program segments included in each of the one or more program segment groups is similar; clustering, by the simulation point tool, program segments included in each of the one or more program segment groups, respectively; based on the clustering results of the plurality of program segments, the performance of the processor may be evaluated.

In the process of dividing the plurality of program segments into one or more program segment groups, the program segments may be sorted according to the execution time of the plurality of program segments, for example, according to the size of the cache miss value, and then the plurality of program segments may be divided into one or more program segment groups according to the sorting.

The similar execution time of the program segments in each program segment group may refer to a similar cache miss rate of the program segments in each program segment group. If the number of representations of a program segment selected from the plurality of program segments is too large, the time to estimate the performance of the processor based on the benchmark program cannot be effectively reduced. Thus, to control the number of program segments representations, as one example, the difference in execution time of the program segments included in each of the one or more program segment groups is less than or equal to 5%.

By the method, the program segments with similar execution time can be divided into one program segment group, so that errors caused by different execution times of a plurality of program segments can be eliminated to a certain extent when the execution time of the benchmark test program is estimated by adopting the simplet tool in the program segment group, and errors of estimating the performance of the processor by adopting the simplet tool can be reduced.

To further reduce the estimation error, the program segments may be clustered within a set of program segments based on their execution time.

In some embodiments, invoking the simulation point tool to cluster the plurality of program segments based on their execution times may increase a time feature value for each program segment during the clustering process, where the time feature value is used to indicate the execution time of the program segment, such as may be the result of the estimation of the execution time of the program segment as mentioned above. As described above, clustering can be performed based on basic block vectors of program segments, and thus, basic block vectors of each of a plurality of program segments, which include time feature values of the program segments, can be constructed.

For example, an element may be added to the basic block vector of the program segment, the element being used to indicate the execution time of the program segment. The basic block vector of the program segment is typically stored in the bb file, and therefore, the value dim_max of the maximum number of bits of the basic block vector can be obtained by traversing the bb file, and then the time characteristic is added to the basic block vector as the second m_max dimension.

And clustering the program segments according to the basic block vector comprising the time characteristic value, so that the clustering result of the program segments is associated with the execution time of the program segments, and the error of the processor performance estimation result can be reduced.

As described above, invoking the simulated point tool may obtain a clustering result of the plurality of program segments, which may include one or more program segment sample representations. As one implementation, the weights represented by one or more program segment samples may be determined based on execution times of the plurality of program segments, thereby evaluating performance of the processor based on the weights.

In practical use, the clustering result of the program segments obtained by calling the simulation point tool can further comprise initial weights represented by one or more program segment samples. Since the initial weights represented by the one or more program segment samples given by the simulation point tool cannot represent the weights of the real execution time of the program segments, the updated weights represented by the one or more program segment samples can be obtained according to the execution time of the program segments, and the performance of the processor can be evaluated based on the initial weights and the updated weights. For example, the initial weights and updated weights represented by the program segment samples may be weighted and the result used as the final weight to estimate the performance of the processor.

As an example, the update weight represented by a segment sample may be determined based on a ratio of a sum of execution times of segments in the cluster in which the segment sample is represented to a sum of execution times of all segments of the benchmark program.

By determining the weights represented by the program segment samples based on execution time, the accuracy of estimating processor performance can be further improved.

As mentioned above, the execution time of a program segment can be calculated based on the cache miss information of the program segment. In actual use, the cache miss information may include one or more of information such as a number of cache misses (miss_num), a total number of accesses (miss_num), or a miss ratio (miss_rate) in a program segment.

Based on the cache miss information, the number of cache miss instructions in the program segment may be determined, thereby calculating an execution time of the program segment. For example, the execution time of a program segment is determined based on the product of the number of instructions that miss the cache and the first execution time, and the product of the number of other instructions in the program segment and the second execution time. The first execution time may be the time required to execute an instruction that misses in a cache, and the second execution time may be the time required to execute an additional instruction, such as in a clock cycle. Typically, the time to execute the cache miss instruction is greater than the time to execute the other instructions, i.e., the first execution time is greater than the second execution time.

Taking cache miss information as three-level cache miss information as an example, under different simulation architectures, such as different FPGAs, the execution time of the instructions of the three-level cache miss, that is, the first execution time may be different, and the first execution time may be determined based on the actual configuration situation.

As one example, the first execution time may be 400 clock cycles and the second execution time may be 1 clock cycle. Based on this, taking the instruction number of the program segment as 100M (100×1024×1024) as an example, a calculation formula of the execution time (interval_time) of the program segment can be obtained: interval_time=MIss_num×400+ (100×1024×1024-MIss_num) ×1.

To obtain cache miss information for a program segment, a script may be utilized to simulate the cache behavior in actual execution. To preserve the simple architecture independent characteristics, a common cache model may be selected for simulation. The three-level cache model logically comprises a first-level cache and a second-level cache, and under the condition that the three-level cache is not hit, the time delay for executing the instruction is the largest, so that the simulation can be performed by adopting the three-level cache model. A general three-level cache model is given below:

(cache_size,line_size,[line_num,way_num],index_method,replacement_method)

(2048,64,[2048,16],'TRA','PLRU')

Wherein, the cache_size is the cache size, the line_size is the cache line size, the line_num is the line number of the cache, the way_num is the line number of the group association, the index_method is the index mode, and the replacement_method is the replacement algorithm.

In the above model, the cache size is 2048, the cache line size is 64, the number of group links is 16, the index mode is the TRA mode, and the replacement algorithm is the PLRU algorithm.

In order to simulate the cache behavior of the program segment in the execution process, access information of the program segment needs to be acquired. For example, the pin tool trace may be used to record the PC information of the jump instruction for the whole benchmark program, thereby obtaining the access file of the program segment. The access file can record information such as PC value, virtual address, value recorded in memory, memory length, instruction type (such as access instruction and storage instruction), etc.

And simulating the replacement process of the instruction in the cache model by utilizing the information in the access file of the program segment, so as to count the number of misses, the total number of access and the ratio of access misses in each program segment, and the like.

It should be noted that the aforementioned methods for reducing the simple estimation error may be used alone or in combination.

Fig. 2 is a flowchart of another method for estimating performance of a processor according to an embodiment of the present application. The method of estimating the processor operating performance is described in detail below in conjunction with FIG. 2.

Method 200 may help improve the accuracy of a simple tool in a binary program execution time scenario that shortens the SPEC CPU benchmark test suite using a simple tool by introducing the execution time of the program segment.

Referring to fig. 2, the method 200 may include steps S210 to S280.

In step S210, a benchmark test program is executed.

Before optimizing the processor performance estimation method using the simple tool, a benchmark program, such as a binary program of the benchmark program of the SPEC CPU, is typically executed once to obtain relevant information of the benchmark program.

To facilitate the use of the simple tool, the pin tool trace may be used to trace the instruction stream of the entire benchmark program during execution of the benchmark program, and to segment the instruction stream into fixed-size segments, i.e., the plurality of program segments mentioned above, according to a fixed instruction stream size.

In step S220, a memory access file is acquired.

In order to obtain the cache miss information of the program segments, access information of each program segment, such as an access file mem_interval, needs to be obtained first. As an implementation manner, in the process of adopting the instruction stream of the pin tool trace benchmark test program, the PC information of each actual jump instruction can be recorded, and the access file can be obtained. The access file can record information such as PC value, virtual address, value recorded in memory, memory length, instruction type (such as access instruction and storage instruction), etc.

In step S230, a program segment file is acquired.

In order to construct the basic block vector of a program segment, a program segment file of the program segment needs to be acquired first. As an implementation manner, information such as a jump instruction type, a source address of the jump instruction, a destination address of the jump instruction and the like can be recorded in the process of tracking the benchmark test program instruction stream so as to form a program segment file.

In step S240, time characteristic data of the program segment is acquired.

As mentioned above, the time characteristic data of the program segment can be acquired based on the cache miss information of the program segment. As one implementation manner, based on the access information of the program segment, the cache model simulates the cache behavior of the program segment in the actual execution process, so that the time characteristic data of the program segment can be obtained. The access information may be obtained through the access file of the program segment in step S220.

To preserve the simple architecture independent characteristics, a common cache model may be selected for simulation. The three-level cache model logically comprises a first-level cache and a second-level cache, and under the condition that the three-level cache is not hit, the time delay for executing the instruction is the largest, so that the simulation can be performed by adopting the three-level cache model.

The foregoing provides a general three-level cache model, and for brevity, will not be described in detail herein.

And simulating the replacement process of the instruction in the cache in a cache model by utilizing the information in the memory file of the program segment, so that cache miss information such as the number of times of miss, the total number of times of memory access, the ratio of memory access miss and the like in each program segment can be counted.

The execution time of the instruction with the cache miss can be flexibly determined according to actual situations, such as the configuration of the adopted FPGA. Considering the worst case configuration, the execution time of the instruction (i.e., the first execution time mentioned earlier) may be 400 clock cycles at the time of the three-level cache miss.

Taking the first execution time as 400 clock cycles, the execution time of the other instructions except the three-level cache miss instruction in the program segment is 1 clock cycle as an example, and the time characteristic data of the program segment, such as the execution time (interval_time) of the program segment, can be obtained by the following formula: interval_time=MIss_num×400+ (100×1024×1024-MIss_num) ×1.

In some embodiments, the execution time of a program segment may be used to determine a weight file for the program segment.

In step S250, a. Bb file is generated.

First, an initial basic block vector is acquired. The bbv _build tool may be used to traverse the entire jump instruction stream, complement the entire dynamic instruction stream of the benchmark program, and at the same time, may use the control flow graph (control flow graph, cfg) to obtain all basic block information (e.g., the index and size of the basic block). From this information, an initial basic block vector of the program segment can be obtained, which can be generated as a. Bb file.

Secondly, the time characteristic data is added in the initial basic block vector to obtain a time characteristic-containing. For example, the value dim_max of the maximum number of bits of the basic block vector of the program segment may be obtained by traversing the original basic block vector of the program segment. Then, the time characteristic data of the program segment is added as the second m_max dimension of each basic block vector to the last dimension of the basic block vector as an additional one of the feature values.

And finally, constructing a new. Bb file according to the grouping information of the program segments.

To further reduce the estimation error of the simple, the program segments may be divided into one or more program segment groups prior to clustering the plurality of program segments. For example, the program segments may be ordered according to a cache_miss value size, and the ordered program segments may be divided into one or more large segments (parts), i.e., one or more of the program segment groups mentioned above.

If the number of representations of selected program segments (i.e., selected checkpoint images) from the plurality of program segments is too large, the time to estimate processor performance based on the benchmark program cannot be effectively reduced. Therefore, according to the experimental result, when a plurality of program segments are grouped, the difference of cache_miss in each program segment group can be controlled within 5%.

From the partitioning of one or more program segment groups, a new bb file may be reconstructed. For example, the bb files of the plurality of program segments are written into the bb files of the set of program segments in groups of program segments, i.e., the number of final bb files is the same as the number of sets of program segments.

In step S260, a simulation point file and a weight file are acquired.

Clustering of the plurality of program segments within the set of program segments is accomplished by invoking a simpoint tool to sample the x.bb files of each set of program segments, where the sampling parameters may include "-saveSimpoint", "-saveSimpoint weights", "-saveLabels", and "-maxK". Wherein "-saveSimpoints" is the number of the program segment sample representation of the cluster center from which the cluster sample was obtained, "-saveSimpoint weights" is the weight of the program segment sample representation, "-saveLabels" is the number of the cluster center to which each program segment belongs, i.e. which program segments are obtained to be clustered in the same cluster, and "-maxK" is the maximum value setting the number of cluster centers.

If the K-means algorithm is adopted to sample a plurality of program segments, the K value (the number of the clustering centers is within the K value, namely, the number of the clustering centers is smaller than or equal to the K value) defaults to 10, and the K value can be adjusted according to the requirement and the actual situation.

Based on the results of the above sampling, a simulation point file (a. Simple file), a weight file (a. Weights file), and a. Laminates file for each program segment group can be obtained. That is, a segment sample representation in each segment group and a weight of the segment sample representation within the segment group may be obtained.

However, since the sampling is performed in the corresponding x.bb file for each program segment group, that is, the number in the sampling result is the number in the program segment group, the number in the sampling result needs to be mapped to the global number to obtain a new x.simple file.

In consideration of the time characteristics of the program segments, the weights directly generated by the simple tool cannot represent the weights of the real execution times of the program segments, and thus, the weights may be recalculated based on the execution times of the program segments acquired in step S240. Taking an example of m program segments in a program segment group, the simple tool may select n cluster centers (corresponding to n clusters) from the program segment group, where each cluster center represents a cluster within the program segment group. Wherein the time characteristic data of each program segment is ti (i=1, 2..once., m), and if the i-th cluster has xi program segments, each program segment in the i-th cluster has a time characteristic of tj (j=1, 2..once., xi), the weight weights of the i-th cluster center point are equal to each other _i Can be calculated by the following formula.

To further increase the accuracy of the processor performance estimation results, the initial weights (i.e., weights directly given by the simple tool) and updated weights (i.e., the recalculated weights described above) represented by the program segment samples may be weighted and summed and the results used as final weights to estimate the performance of the processor.

In step S270, data is prepared.

To enhance the performance of the simplepoint tool, one or more of the following data may be prepared before the segment sample representation (i.e., the segment corresponding to the cluster center) is run: checkpoint (checkpoint) image file, last 1000 instructions of program segment, 5M warm-up data. The number of the instructions can be flexibly set according to actual conditions.

The check point image file is specific content of hardware such as a register, a CPU and the like of a starting point of execution of the program segment on the FPGA, and can rapidly run a sample representation of the program segment in the verification process based on the check point image file.

In step S280, the FPGA verifies.

And executing the program segment sample representation on the FPGA according to the check point image file, and recording the execution time of each program segment. Further, based on the recorded result and the final weight obtained in step S260, the execution time of the complete benchmark program may be estimated. The running performance of the processor may be evaluated based on the execution time of the complete benchmark program.

In order to verify the correctness of the image file of the check point, the last 1000 instructions of each program segment can be obtained, the FPGA also obtains 1000 instructions, and the two instructions are compared to ensure the correctness of the execution of the program.

In order to improve accuracy of the program segment execution time estimation result, the first 5M instruction of the program segment may be acquired to perform preheating, so that the whole hardware is in a normal execution state. Experiments prove that the method can well improve the evaluation effect.

In order to verify the accuracy of the method for estimating the running performance of the processor provided by the embodiment of the application, a complete benchmark test program can be executed on the FPGA under the condition of bare metal and the condition of joining an operating system, and the execution time of the complete benchmark test program is obtained. Through experimental verification, the running time of the estimated benchmark test program is closer to the execution time of bare metal by directly adopting the weight given by the simplepoint, and the estimated result of the weight based on the execution time of the program segment provided by the embodiment of the application is closer to the execution time under the condition of an operating system, that is, the estimated result of the performance of the processor is closer to the actual use condition by adopting the method provided by the embodiment of the application.

According to experimental results, by adopting the performance evaluation method of the processor in the embodiment of the application, the error of the simple tool estimation time under the SPEC CPU 2006 can be reduced from 21% to within 3%, and the effect is good.

Method embodiments of the present application are described above in detail in connection with fig. 1-2, and apparatus embodiments of the present application are described below in detail in connection with fig. 3-4. It is to be understood that the description of the device embodiments corresponds to the description of the method embodiments, and that parts not described in detail can therefore be seen in the preceding method embodiments.

Fig. 3 is a schematic structural diagram of an apparatus for evaluating performance of a processor according to an embodiment of the present application, where the apparatus 300 includes an acquisition module 310 and an evaluation module 320.

An obtaining module 310, configured to obtain execution times of a plurality of program segments in the benchmark test program;

and the evaluation module 320 is configured to invoke a simulation point tool, and evaluate the performance of the processor according to the execution time of the plurality of program segments.

Optionally, the evaluation module is configured to: invoking the simulation point tool, and clustering the program segments based on the execution time of the program segments; and evaluating the performance of the processor according to the clustering result of the program segments.

Optionally, the invoking the simulated point tool clusters the plurality of program segments based on execution time of the plurality of program segments, including: dividing the plurality of program segments into one or more program segment groups according to the execution time of the plurality of program segments; and calling the simulation point tool to cluster the program segments included in each program segment group of the one or more program segment groups respectively.

Optionally, the dividing the plurality of program segments into one or more program segment groups according to the execution time of the plurality of program segments includes: sequencing the program segments according to the execution time of the program segments; the plurality of program segments are divided into one or more program segment groups based on a result of the ordering of the plurality of program segments.

Optionally, the invoking the simulated point tool separately clusters the program segments included in each of the one or more program segment groups, including: and calling the simulation point tool, and clustering the program segments included in each program segment group of the one or more program segment groups based on the execution time of the program segments.

Optionally, the invoking the simulated point tool clusters the plurality of program segments based on execution time of the plurality of program segments, including: constructing a basic block vector of each program segment in the plurality of program segments, wherein the basic block vector comprises a time characteristic value of the program segment, and the time characteristic value is used for indicating the execution time of the program segment; and calling the simulation point tool, and clustering the program segments based on the basic block vector.

Optionally, the evaluation module is configured to: invoking the simulation point tool to obtain a clustering result of the program segments, wherein the clustering result comprises one or more program segment sample representations; determining weights represented by the one or more program segment samples based on execution times of the plurality of program segments; the performance of the processor is evaluated based on the weights represented by the one or more program segment samples.

Optionally, the acquiring the execution time of the plurality of program segments in the benchmark test program includes: obtaining cache miss information of each program segment in a plurality of program segments in the benchmark test program; and determining the execution time of a plurality of program segments in the benchmark test program based on the cache miss information.

Optionally, each program segment of the plurality of program segments includes an instruction corresponding to the cache miss information and other instructions, and determining, based on the cache miss information, execution time of the plurality of program segments in the benchmark test program includes: and determining the execution time of the program segments based on the product of the number of the instructions corresponding to the cache miss information and the first execution time and the product of the number of the other instructions and the second execution time, wherein the first execution time is larger than the second execution time.

Optionally, the obtaining the cache miss information of each of the plurality of program segments in the benchmark test program includes: and based on the three-level cache model, counting the cache miss information of each program segment in the program segments in the benchmark test program.

Fig. 4 is a schematic structural diagram of a computing device according to an embodiment of the present application. The computing device 400 shown in fig. 4 may include a memory 410 and a processor 420. In some embodiments, the computing device 400 shown in fig. 4 may also include an input/output interface 430. The memory 410 is used to store instructions and the processor 420 is used to execute the instructions stored by the memory 410 to perform the methods described in any of the previous embodiments.

It should be appreciated that in the embodiments of the present application, the processor 420 may be a general-purpose central processing unit (central processing unit, CPU), a microprocessor, an application-specific integrated circuit (application specific integrated circuit, ASIC), or one or more integrated circuits for executing related programs to implement the solutions provided in the embodiments of the present application.

The memory 410 may include read only memory and random access memory and provides instructions and data to the processor 420. A portion of the processor 420 may also include non-volatile random access memory. For example, the processor 420 may also store information of the device type.

In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in processor 420. The method disclosed in connection with the embodiments of the present application may be embodied directly in hardware processor execution or in a combination of hardware and software modules in a processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 410, and the processor 420 reads the information in the memory 410, and in combination with its hardware, performs the steps of the method described above. To avoid repetition, a detailed description is not provided herein.

It should be appreciated that in embodiments of the present application, the processor may be a central processing unit (central processing unit, CPU), the processor may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The present application also provides a computer-readable storage medium storing a program to cause a computer to execute the method described in any one of the foregoing embodiments.

It should be understood that in the embodiments of the present application, "B corresponding to a" means that B is associated with a, from which B may be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber Line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be read by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of evaluating performance of a processor, comprising:

acquiring execution time of a plurality of program segments in a benchmark test program;

and calling a simulation point tool, and evaluating the performance of the processor according to the execution time of the program segments.

2. The method of claim 1, wherein the invoking the simulation site tool evaluates performance of the processor based on execution time of the plurality of program segments, comprising:

invoking the simulation point tool, and clustering the program segments based on the execution time of the program segments;

and evaluating the performance of the processor according to the clustering result of the program segments.

3. The method of claim 2, wherein the invoking the simulated point tool to cluster the plurality of program segments based on execution times of the plurality of program segments comprises:

Dividing the plurality of program segments into one or more program segment groups according to the execution time of the plurality of program segments;

and calling the simulation point tool to cluster the program segments included in each program segment group of the one or more program segment groups respectively.

4. A method according to claim 3, wherein said dividing the plurality of program segments into one or more program segment groups according to execution times of the plurality of program segments comprises:

sequencing the program segments according to the execution time of the program segments;

the plurality of program segments are divided into one or more program segment groups based on a result of the ordering of the plurality of program segments.

5. A method according to claim 3, wherein said invoking the simulated point tool to separately cluster the program segments included in each of the one or more program segment groups comprises:

and calling the simulation point tool, and clustering the program segments included in each program segment group of the one or more program segment groups based on the execution time of the program segments.

6. The method of claim 2, wherein the invoking the simulated point tool to cluster the plurality of program segments based on execution times of the plurality of program segments comprises:

Constructing a basic block vector of each program segment in the plurality of program segments, wherein the basic block vector comprises a time characteristic value of the program segment, and the time characteristic value is used for indicating the execution time of the program segment;

and calling the simulation point tool, and clustering the program segments based on the basic block vector.

7. The method of claim 1, wherein the invoking the simulation site tool evaluates performance of the processor based on execution time of the plurality of program segments, comprising:

invoking the simulation point tool to obtain a clustering result of the program segments, wherein the clustering result comprises one or more program segment sample representations;

determining weights represented by the one or more program segment samples based on execution times of the plurality of program segments;

the performance of the processor is evaluated based on the weights represented by the one or more program segment samples.

8. The method of claim 1, wherein the obtaining the execution time of the plurality of program segments in the benchmark program comprises:

obtaining cache miss information of each program segment in a plurality of program segments in the benchmark test program;

And determining the execution time of a plurality of program segments in the benchmark test program based on the cache miss information.

9. The method of claim 8, wherein each of the plurality of program segments includes instructions corresponding to the cache miss information and other instructions therein, wherein determining execution time of the plurality of program segments in the benchmark test program based on the cache miss information comprises:

and determining the execution time of the program segments based on the product of the number of the instructions corresponding to the cache miss information and the first execution time and the product of the number of the other instructions and the second execution time, wherein the first execution time is larger than the second execution time.

10. The method of claim 8, wherein the obtaining cache miss information for each of the plurality of segments in the benchmark program comprises:

and based on the three-level cache model, counting the cache miss information of each program segment in the program segments in the benchmark test program.

11. An apparatus for evaluating performance of a processor, comprising:

the acquisition module is used for acquiring the execution time of a plurality of program segments in the benchmark test program;

And the evaluation module is used for calling a simulation point tool and evaluating the performance of the processor according to the execution time of the program segments.

12. The apparatus of claim 11, wherein the evaluation module is to:

13. The apparatus of claim 12, wherein the invoking the simulated point tool to cluster the plurality of program segments based on execution times of the plurality of program segments comprises:

14. The apparatus of claim 13, wherein the dividing the plurality of program segments into one or more program segment groups according to execution times of the plurality of program segments comprises:

15. The apparatus of claim 13, wherein the invoking the simulated point tool to separately cluster the program segments included in each of the one or more program segment groups comprises:

16. The apparatus of claim 12, wherein the invoking the simulated point tool to cluster the plurality of program segments based on execution times of the plurality of program segments comprises:

17. The apparatus of claim 11, wherein the evaluation module is to:

18. The apparatus of claim 11, wherein the obtaining the execution time of the plurality of program segments in the benchmark program comprises:

19. The apparatus of claim 18, wherein each of the plurality of program segments includes instructions corresponding to the cache miss information and other instructions therein, wherein the determining the execution time of the plurality of program segments in the benchmark test program based on the cache miss information comprises:

20. The apparatus of claim 18, wherein the obtaining cache miss information for each of a plurality of program segments in the benchmark program comprises:

21. A computing device, comprising:

a memory for storing codes;

a processor for executing code stored in the memory to perform the method of any one of claims 1-10.

22. A computer readable storage medium having stored thereon code for performing the method of any of claims 1-10.