CN108846248A

CN108846248A - A kind of application modeling and performance prediction method

Info

Publication number: CN108846248A
Application number: CN201810980603.0A
Authority: CN
Inventors: 孙凝晖; 谭光明; 谢震
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-08-27
Filing date: 2018-08-27
Publication date: 2018-11-20
Anticipated expiration: 2038-08-27
Also published as: CN108846248B

Abstract

The present invention provides a kind of modeling of application program and performance prediction method, the application modeling method include：Computations and access instruction are obtained from the compiled obtained instruction of application, according to the execution of the architectural feature Modeling Calculation for the machine for running application instruction and access instruction, obtain the time overhead of computations and access instruction；And the regular memory access and/or irregular memory access applied in the memory access stage is modeled according to the architectural feature, obtain the time overhead of regular memory access and/or irregular memory access；And calculate the time overhead in the memory access stage of the application.The present invention can accurately and efficient predict application performance, to help application developer to find using bottleneck and take opposite prioritization scheme.

Description

A kind of application modeling and performance prediction method

Technical field

The present invention relates to optimizing application field more particularly to a kind of application modeling and the methods of prediction application performance.

Background technique

With the progress of semiconductor technology, the processor with Multi-Level Cache (multi-level buffer) system has become currently The mainstream of processor, the increase of caching component is met using the demand to memory access locality in processor.However, due to current The design of processor becomes increasingly complex and the size of Cache, level are different, when application is operated in different system knots When on the machine of structure, the speed of service is also different.How accurately to predict to apply the performance on the machine of different architecture, And optimizing application according to estimated performance is current one of research hotspot.

Currently, having the performance that some models can be used to predict application, such as Roofline model, which can be described Computation memory access predicts the peak performance of application than the relationship with bandwidth.However, Roofline model does not account for The level of Cache, thus it is not accurate enough to the prediction of application performance.ECM (the Execution-Cache- that Holger et al. is proposed Memory) operation of application is divided into two stages of in-core (in core) and out-core (outside core) by model, reflects application In the calculating in core and the transmission between memory level.However, ECM model do not distinguish Cache at different levels (for example, L1Cache-L3Cache), missing (Cache Miss) number on every level-one Cache is calculated as it is identical, for there are data For the application or the lesser application of data scale of reuse, ECM model can not accurately predict the performance of application.

Summary of the invention

To solve above-mentioned problems of the prior art, according to one embodiment of present invention, a kind of application is provided and is built Mould method, including：Computations and access instruction are obtained from the compiled obtained instruction of application, according to the operation application The execution of architectural feature the Modeling Calculation instruction and access instruction of machine, show that the time of computations and access instruction opens Pin；And the regular memory access and/or irregular memory access applied in the memory access stage is modeled according to the architectural feature, it obtains The time overhead of regular memory access and/or irregular memory access out；And calculate the time overhead in the memory access stage of the application.

In the above method, according to the instruction of the architectural feature Modeling Calculation for the machine for running the application and access instruction Execution, show that the time overhead of computations and access instruction includes：

According to the architectural feature for the machine for running the application, simulation computations are held in corresponding one or more Execution on row port and access instruction is simulated in the corresponding one or more execution executed on ports, calculate each execution The time for each instruction of port；And from the one or more time for each instructions for executing port for executing computations, choosing Time overhead of the longest time for each instruction as computations is selected, and is executed from the one or more of access instruction are executed In the time for each instruction of port, select longest time for each instruction as the time overhead of access instruction.

In the above method, the regular memory access applied in the memory access stage is modeled according to the architectural feature, is obtained The time overhead of regular memory access includes：

Step 1) prefetches strategy according to what the architectural feature obtained Cache at different levels, analyzes the application and is advised Then data volume involved in memory access；

The missing times that prefetch strategy and the data volume calculate at different levels Caches of the step 2) based on the Cache at different levels；

Step 3) is according to the bandwidth and main memory and highest level between the missing times of Cache at different levels, Cache The size of bandwidth, Cache Line between Cache calculates between Cache and between main memory and the Cache of highest level Data transmission period expense；

Data transmission period expense between Cache is added by step 4), and adds the Cache of main memory and highest level Between data transmission period expense, obtain the time overhead of regular memory access.

In step 3), the data between Cache and between main memory and the Cache of highest level are calculated according to the following formula and are passed Defeated time overhead：

Ti=Ni*Size (CL)/Bi

Wherein, Ti indicates the i grades Cache and i+1 grades data transmission period expense between Cache or main memory, and Ni indicates i grades The missing times of Cache, Bi indicate the i grades of Cache and i+1 grades of bandwidth between Cache or main memory, and Size (CL) indicates Cache The size of Line, i >=1.

In the above method, the irregular memory access applied in the memory access stage is modeled according to the architectural feature, is obtained The time overhead of irregular memory access includes out：

Step a) constructs Cache simulator according to the architectural feature, and constructs the irregular visit of the application The memory access sequence deposited；Wherein, the input of the Cache simulator is memory access sequence, and output is the missing time of Cache at different levels Number.

Step b) obtains Cache simulator described in the memory access sequence inputting of the irregular memory access of the application at different levels The missing times of Cache；

Step c) is according to the bandwidth and main memory and highest level between the missing times of Cache at different levels, Cache The size of bandwidth, Cache Line between Cache calculates between Cache and between main memory and the Cache of highest level Data transmission period expense；

Data transmission period expense between Cache is added by step d), and adds the Cache of main memory and highest level Between data transmission period expense, obtain the time overhead of irregular memory access.

In the above method, the time overhead for calculating the memory access stage of the application includes：

If irregular memory access is not present in the memory access stage in described apply, using the time overhead of regular memory access as described in The time overhead in the memory access stage of application；

Apply in the memory access stage if described there are irregular memory access, by the time overhead of the regular memory access with it is described Time overhead of the sum of the time overhead of irregular memory access as the memory access stage of the application.

According to one embodiment of present invention, a kind of application performance prediction technique is also provided, including：

Application to be predicted is applied to the model that the above method is established, obtains the time of the computations of the application The time overhead of expense, the time overhead of access instruction and memory access stage；

The time overhead of the access instruction and the time overhead in the memory access stage are summed；

Summed result is compared with the time overhead of the computations, using the greater therein as the application Expected time.

The processing stage and memory access stage that the architectural feature of present invention combination machine (computer) respectively corresponds Modeling.Wherein, the processing stage of application is further divided into the time overhead of computations and the time overhead of access instruction； And the memory access stage is further divided into the time overhead of regular memory access and the time overhead of irregular memory access, while being distinguished each Grade Cache.The present invention can be more accurate and efficient predicts application performance, answers to be conducive to help application developer and find With bottleneck and take opposite prioritization scheme.

Detailed description of the invention

Embodiments of the present invention is further illustrated referring to the drawings.

Fig. 1 is the flow chart of application modeling according to an embodiment of the invention and performance prediction method；

Fig. 2 is the schematic diagram according to an embodiment of the invention using modeling method；

Fig. 3 is the flow chart of the memory access stage modeling method of application according to an embodiment of the invention；

Fig. 4 is Cache simulator internal connection methods and Cache according to an embodiment of the invention and main memory and post The schematic diagram of the connection method of storage.

Specific embodiment

In order to make the purpose of the present invention, technical solution and advantage are more clearly understood, and are passed through below in conjunction with attached drawing specific real Applying example, the present invention is described in more detail.It should be appreciated that described herein, specific examples are only used to explain the present invention, and It is not used in the restriction present invention.

The present invention is applied in the operational process on machine and is divided into processing stage and memory access stage, and wherein processing stage relates to And execution of the compiled obtained instruction of application in processor core, and the memory access stage is related to using between cachings at different levels when running And the data between main memory and highest level caching are transmitted.The present invention by processing stage be further divided into computations when Between expense and access instruction time overhead；The memory access stage on the one hand be divided into regular memory access time overhead and irregular memory access Time overhead, the data transmission period expense and main memory being on the other hand divided between cachings at different levels and highest level it is slow Data transmission period expense between depositing.

According to one embodiment of present invention, a kind of application modeling and performance prediction method are provided, referring to Fig. 1, this method It is divided into using modelling phase and performance prediction stage.Wherein, using the body that the modelling phase includes in conjunction with (operation application) machine Architecture feature models application, and each time overhead being applied；The performance prediction stage includes calculating application The execution time on machine.The two stages are unfolded to be described in detail below in conjunction with attached drawing.

One, applies the modelling phase

It include processing stage modeling and the modeling of memory access stage using the modelling phase, i.e., to (being obtained using compiled) instruction Execution in processor core carries out modeling and models to the transmission of application runtime data.It is retouched respectively referring now to Fig. 2 State processing stage modeling and the modeling of memory access stage.

Processing stage modeling

As described above, processing stage is related to using execution of the compiled obtained instruction in processor core.Processing stage Modeling models the execution of application compiled obtained computations and access instruction, and the time for generating two kinds of instructions opens Pin.The modeling of processing stage includes the following steps：

1. obtaining computations and access instruction from the compiled obtained instruction of application, detection will run the application The architectural feature of machine, the execution based on architectural feature simulation computations and access instruction.Specifically, according to finger The out-of-order scheduling mechanism and assembly line mechanism of order, simulate computations and access instruction executes port accordingly in the processor It is upper to execute (it should be noted that the instruction set that processor uses includes but is not limited to X86, MIPS or ARM etc., and processor The quantity of core be also possible to one be also possible to it is multiple).There are multiple execution ports in processor, each execution port is processing rank The execution unit of Duan Jianmo executes corresponding instruction for dispatching, with realize to L1Cache (1 grade of Cache, such as L1Dcache the operation of the data in).Wherein, computations are executing port for executing the one or more of computations It executes, and access instruction executes on port in one or more execute for executing access instruction.

2. calculating the time for each instruction on each execution port, executed from for executing the one or more of computations In the time for each instruction of port, select longest time for each instruction as the time overhead of computations.From for executing In one or more time for each instructions for executing port of access instruction, longest time for each instruction is selected to refer to as memory access The time overhead of order.

The modeling of memory access stage

The memory access stage be related to applying when being run on machine between caching at different levels and main memory and the caching of highest level it Between data transmission.According to one embodiment of present invention, memory access divided stages are to the regular memory access of data and to data Irregular memory access, the modeling of memory access stage is related to carrying out regular memory access modeling and modeling irregular memory access, thus Obtain the time overhead in memory access stage.Referring to Fig. 3, the modeling in memory access stage includes the following steps：

The source code of step 310. analysis application obtains the data that access instruction needs to access, and is accessed according to the needs Data judgement is applied in the memory access stage with the presence or absence of irregular memory access mode (referred to as irregular memory access), if there is irregular Memory access then enters step 320, otherwise enters step 350.Wherein, the data address if necessary to access is continuous or span It is smaller, regard regular memory access (also known as continuous memory access) as, the data address if necessary to access is random or span is larger, Such as span is more than the size of a Cache Line (the minimal cache unit in Cache i.e. at different levels), then regards irregular visit as Deposit (also known as discontinuous memory access, random memory access).

Step 320. constructs Cache simulator according to the architectural feature for the machine that will run the application, should The input of Cache simulator is memory access sequence and exports missing (Cache Miss) number for being Cache at different levels.

Specifically, detection machine obtains architectural feature, including：The size of the level of Cache, Cache at different levels, Cache group is connected strategy, Cache swapping in and out strategy etc., constructs Cache simulator based on the architectural feature.Firstly, structure Cache level identical with physical machine is built, the size and its Cache Line piecemeal of Cache at different levels are set, and to each A Cache Line piecemeal is numbered；Secondly, constructing connection type between Cache at different levels and main memory and afterbody The connection type of (highest level) Cache, referring to fig. 4.After the memory access sequence inputting Cache simulator, Cache simulator Corresponding data are searched (wherein, if searching missing, by data from main memory/caching (such as L2Cache based on the memory access sequence Or L3Cache etc.) it is successively transferred to L1Cache, for the use of register access instruction), according to Cache swapping in and out plans at different levels Analogue data is omited in the swapping in and out of Cache at different levels, and Cache at different levels are obtained according to the swapping in and out number of Cache at different levels Cache missing times as output.

The memory access sequence of the irregular memory access of step 330. building application.

Analysis application obtains the address of the access data of irregular memory access, and the size building based on Cache Line is irregular The memory access sequence of memory access.The address sequence of the memory access sequence, that is, irregular memory access access data, each of which element are one corresponding Cache Line piecemeal number.

Step 340. models irregular memory access, i.e., by the memory access sequence inputting Cache simulator of irregular memory access, exports The Cache missing times of Cache at different levels corresponding to irregular memory access, subsequently into step 350.

Step 350. modeling rule memory access obtains the Cache missing times for corresponding to the Cache at different levels of regular memory access, into Enter step 360.

Specifically, Cache at different levels are obtained according to the architectural feature for the machine that will run the application and prefetches strategy, and And analysis application obtains the data volume of access data involved in regular memory access, prefetches strategy and rule based on Cache at different levels The data volume that memory access is related to calculates the Cache missing times of Cache at different levels.

Step 360. calculates the time overhead in memory access stage, the time overhead including regular memory access, if there is irregular Memory access then further includes the time overhead of irregular memory access.The time overhead of computation rule memory access includes：

1. according to the bandwidth and main memory and highest that correspond between the Cache missing times of regular memory access, Cache at different levels The Cache Line size of bandwidth, Cache at different levels between the Cache of rank calculates between Cache, main memory and highest level Cache between data transmission period expense, calculation formula is as follows：

Ti=Ni*Size (CL)/Bi (1)

Wherein, Ti indicates the data between i grades of Cache and i+1 grades of Cache (being then main memory as not having i+1 grades of Cache) Transmission time expense, Ni indicate the Cache missing times of i grades of Cache, and Bi indicates i grades of Cache and i+1 grades of Cache (as not having i + 1 grade of Cache is then main memory) between bandwidth, Size (CL) indicates the size of the Cache Line of Cache at different levels, such as can To be 64Byte；Wherein i >=1.

2. the data transmission period expense between Cache at different levels is added, and add the Cache of main memory and highest level Between data transmission period expense, obtain the time overhead of regular memory access.

The time overhead for calculating irregular memory access is similar with the time overhead of computation rule memory access.

If the memory access stage of application only includes regular memory access, using the time overhead of regular memory access as the memory access stage Time overhead；If the memory access stage of application had both included regular memory access or including irregular memory access, the time in memory access stage is opened Pin is the time overhead of regular memory access and the sum of the time overhead of irregular memory access.

Two, performance prediction stages

In this stage, apply according in each time overhead calculating obtained using the modelling phase in the execution on machine Between, including：

1. the time overhead of the access instruction of application and the time overhead in memory access stage are summed；

2. by step 1. in summed result be compared with the time overhead of computations, using the greater therein as The expected time of application.

The present invention models the processing stage of application and memory access stage respectively, and compared with Roofline model, the present invention is simultaneously The non-UPS upper performance score that application is predicted using only peak value, but it is pre- to provide an accurate performance for machine architecture Phase；Compared with ECM model, the present invention has separated the indifference Cache of ECM, realizes to the pre- of Miss number of Multi-Level Cache It surveys；In addition, the memory access divided stages of application are regular memory access and irregular memory access by the present invention, and construct a Cache mould Quasi- device transmits to simulate the data of irregular memory access, being capable of more accurate and efficient prediction data transmission time overhead.

It should be noted that rule can also be modeled using Cache simulator above in the case where not considering efficiency Then memory access.

It should be noted that some illustrative methods are depicted as flow chart.It is executed although operation is expressed as sequence by flow chart, But it is understood that many operations can be parallel while or synchronously being executed.Furthermore it is possible to rearrange the sequence of operation. Processing can be terminated when operating and completing, but also be can have and be not included in the other step in figure or in embodiment.

It should be understood that the exemplary embodiment of software realization usually carried out in some form of program storage medium coding or Person realizes on some type of transmission medium.Program storage medium can be arbitrary non-transitory storage media, such as disk (for example, floppy disk or hard disk) or CD (for example, compact disk read-only memory or " CD ROM "), and can be it is read-only or Random access.Similarly, transmission medium can be twisted pair, coaxial cable, optical fiber or known in the art some other Applicable transmission medium.

Although the present invention has been described by means of preferred embodiments, the present invention is not limited to described here Embodiment, without departing from the present invention further include made various changes and variation.

Claims

1. it is a kind of using modeling method, including：

Computations and access instruction are obtained from the compiled obtained instruction of application, according to the body for the machine for running the application The execution of architecture feature modeling computations and access instruction obtains the time overhead of computations and access instruction；And

The regular memory access and/or irregular memory access applied in the memory access stage is modeled according to the architectural feature, is obtained The time overhead of regular memory access and/or irregular memory access；And calculate the time overhead in the memory access stage of the application.

2. according to the method described in claim 1, wherein, modeling meter according to the architectural feature for the machine for running the application Instruction and the execution of access instruction are calculated, show that the time overhead of computations and access instruction includes：

According to the architectural feature for the machine for running the application, simulation computations are in corresponding one or more actuating stations Execution on mouth and access instruction is simulated in the corresponding one or more execution executed on ports, calculate each execution port Time for each instruction；And

From the one or more time for each instructions for executing port for executing computations, longest time for each instruction is selected As the time overhead of computations, and the time for each instructions from the one or more execution ports for executing access instruction In, select longest time for each instruction as the time overhead of access instruction.

3. according to the method described in claim 1, wherein, applying according to architectural feature modeling is described in the memory access stage Regular memory access, show that the time overhead of regular memory access includes：

Step 1) prefetches strategy according to what the architectural feature obtained Cache at different levels, analyzes the application and obtains rule and visits Deposit related data volume；

Step 3) according between the missing times of Cache at different levels, Cache bandwidth and main memory and the Cache of highest level it Between bandwidth, the size of Cache Line, calculate between Cache and data between main memory and the Cache of highest level pass Defeated time overhead；

Data transmission period expense between Cache is added by step 4), and plus between main memory and the Cache of highest level Data transmission period expense, obtain the time overhead of regular memory access.

4. according to the method described in claim 3, in step 3), calculated between Cache according to the following formula and main memory and highest Data transmission period expense between the Cache of rank：

Ti=Ni*Size (CL)/Bi

5. according to right want 1 described in method, wherein described apply in the memory access stage is modeled according to the architectural feature Irregular memory access show that the time overhead of irregular memory access includes：

Step a) constructs Cache simulator according to the architectural feature, and construct the irregular memory access of the application Memory access sequence；Wherein, the input of the Cache simulator is memory access sequence, and output is the missing times of Cache at different levels.

Cache simulator described in the memory access sequence inputting of the irregular memory access of the application is obtained Cache's at different levels by step b) Missing times；

Step c) according between the missing times of Cache at different levels, Cache bandwidth and main memory and the Cache of highest level it Between bandwidth, the size of Cache Line, calculate between Cache and data between main memory and the Cache of highest level pass Defeated time overhead；

Data transmission period expense between Cache is added by step d), and plus between main memory and the Cache of highest level Data transmission period expense, obtain the time overhead of irregular memory access.

6. according to the method described in claim 1, wherein, the time overhead for calculating the memory access stage of the application includes：

If irregular memory access is not present in the memory access stage in described apply, using the time overhead of regular memory access as the application The memory access stage time overhead；

If described apply in the memory access stage there are irregular memory access, by the time overhead of the regular memory access and the non-rule Then time overhead of the sum of the time overhead of memory access as the memory access stage of the application.

7. a kind of application performance prediction technique, including：

Application to be predicted is applied to the model that method of any of claims 1-6 is established, obtains described answer The time overhead of the time overhead of computations, the time overhead of access instruction and memory access stage；

Summed result is compared with the time overhead of the computations, using the greater therein as the pre- of the application Meter executes the time.

8. a kind of calculating equipment, including processor and memory, the memory are stored with instruction, when the processor executes institute The calculating equipment is made to execute such as method of any of claims 1-7 when stating instruction.