CN103049310A

CN103049310A - Multi-core simulation parallel accelerating method based on sampling

Info

Publication number: CN103049310A
Application number: CN2012105895076A
Authority: CN
Inventors: 喻之斌; 须成忠; 姜春涛
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2012-12-29
Filing date: 2012-12-29
Publication date: 2013-04-17
Anticipated expiration: 2032-12-29
Also published as: CN103049310B

Abstract

The invention provides a multi-core simulation parallel accelerating method based on sampling. The method includes the steps of S1, selecting a multithread application program as a multi-core benchmark test program; S2, adopting a sampling strategy to the selected multithread application program to acquire an instruction stream sample fragment of each thread; S3, operating the instruction stream sample fragment, acquired in S2, of each thread in a dynamic code analyzing module of a simulator, and segmenting the instruction stream sample fragment of each thread into multiple dispersing segments according to segmenting points; S4, grouping the multiple dispersing segments according to different segmenting points during segmenting; S5, operating the dispersing segments grouped in S4 in corresponding segment simulating modules to obtain simulation time needed for operating the dispersing segments; and S6, sum up all simulation time output by the segment simulating modules in S5 to obtain total time of the multithread application program in S1. Simulation speed is increased evidently, and evaluation cycle is shortened.

Description

A kind of multinuclear simulation parallel acceleration method of sample-based

Technical field

The present invention relates to Computer Architecture system simulation field, be specifically related to a kind of multinuclear simulation parallel acceleration method of sample-based.

Background technology

Along with popularizing and the appearance of many-core processor of the fast development of SOC (system on a chip), polycaryon processor, assembly integrated on the one single chip is more and more, how from the design proposal of exponential growth, to find fast optimal case, become gradually the key of this type systematic of design.For example, how interconnected hundreds and thousands of the processing units of design topology, storage unit, memory allocated level how, the required storage size of each level of how to confirm etc., numerous design parameters will consist of a huge design space, how from the design space of vastness, to navigate to fast optimum design proposal, become the significant challenge that this type of system faces.

The microarchitecture simulation is the gordian technique of processor architecture design initial performance of new generation assessment.This technology is utilized the various designs of software mode analog hardware, usually needs the exploitation phantom frame, and for example simulator is simulated the function that realizes each hardware in simulator, and by moving benchmark entry evaluation design proposal thereon.The phantom frame that microarchitecture simulation assessment technology adopts generally is single-threaded simulator.Single-threaded phantom frame can be applied to the simulation assessment of single core processor preferably, and moves single-threaded benchmark.But, day by day numerous and jumbled along with application program, the popularizing of multithread programs, the appearance of multinuclear, many karyonides system, single-threaded phantom frame can't adapt to the evaluation tasks of this type systematic, has larger limitation, is in particular in:

In order to reflect exactly the program behavior feature of application program, continuous variation along with current various application programs, benchmark for assessment of the microarchitecture performance also needs to do corresponding adjustment, for example, along with popularizing and application program complicated, diversified of multithread application, benchmark also needs to make corresponding change, and the Princeton University had issued multithreading benchmark collection---PARSEC of new generation in 2008.In the face of these variations, traditional single-threaded phantom frame seems particularly painstaking when moving huge multithreading benchmark, multithread programs of dry run, the simulated time that often needs several weeks even some months, the assessment cycle of overlength can't be competent at the assessment of the many karyonide systems that design for multithread application.

Along with the appearance that universal and many karyonides of polycaryon processor are united, the parallel processing capability of computer hardware improves constantly.Traditional single-threaded phantom frame does not have good concurrency and extendability, can't adopt the mode parallel running of multithreading on the multinuclear hardware resource, thereby not utilize better these hardware resources, causes the waste of hardware resource.Most other simulation precisions of clock period level that adopt of traditional phantom frame, namely such simulator can accurately be simulated the operation that system finishes in each clock period.Adopting the advantage of the simulation strategy of clock period rank simulation precision is the better detailed simulation of each details of completion system, higher Evaluation accuracy is arranged, and can feed back more useful information.But such simulator greatly reduces analog rate owing to pursuing simulation precision, and this defective is particularly outstanding when the design proposal of the many karyonide systems of assessment.

Using sampling policy in the microarchitecture simulation is an important method of speeding-up simulation.The method is used sampling policy by the instruction stream to benchmark, obtains the instruction stream sample fragment that can reflect whole program behavior feature.By the simulation of representative instruction stream sample fragment being finished the Performance Evaluation of microarchitecture.Sampling policy is the instruction number speeding-up simulation that needs simulation by reducing in essence.Traditional sampling policy that is applied to the microarchitecture simulation has multiple.For example, the systematic sampling strategy takes the instruction stream fragment of same intervals as sample from the instruction stream of application program, and SMART Sim phantom frame has used the systematic sampling strategy to come speeding-up simulation; The representative sample strategy be from the instruction stream of application program, choose representative, can reflect that the instruction stream fragment of whole program behavior feature is as sample, this strategy needs the behavioural characteristic of application programs to carry out static analysis, and the SimPoint phantom frame has used this strategy to come speeding-up simulation.In addition, also have the multiple strategies such as stochastic sampling strategy, two stage sampling policies, stratified sampling strategy.The above sampling policy that exemplifies has a common defective, namely need to determine parameter in the sampling policy by static analysis or tentative simulation, such as the size of instruction sample fragment etc., and static analysis or tentative simulation need to spend more time cost, and when working as the microarchitecture of simulating and changing, need to re-start static analysis or tentative simulation, the duplication of labour is unfavorable for the acceleration of simulating.

Summary of the invention

Technical matters to be solved by this invention provides a kind of remarkable shortening and simulates assessment cycle, keeps the accuracy of assessment, and takes full advantage of the multinuclear simulation parallel acceleration method of the sample-based of multinuclear hardware resource.

For achieving the above object, the invention provides following technical scheme:

A kind of multinuclear simulation parallel acceleration method of sample-based comprises:

S1: selected multithread application is as the multinuclear benchmark;

S2: multithread application selected among the S1 is adopted sampling policy, obtain the instruction stream sample fragment of each thread;

S3: the instruction stream sample fragment of each thread of obtaining among the S2 is operated in the dynamic code analysis module of simulator, the instruction stream sample fragment of each thread is divided into a plurality of discrete segments according to the difference of cut-point;

S4: the difference of cut-point when cutting apart of a plurality of discrete segments among the S3 is divided into groups;

S5: the discrete segments after the grouping among the S4 is operated in the corresponding fragment analog module, draw the simulated time that described discrete segments is moved required cost;

S6: with the simulated time addition of exporting in the fragment analog modules all among the S5, T.T. is carried out in the simulation that draws multithread application among the S1.

Further, the sampling policy among the described S2 comprises:

The instruction stream fragment of each thread is carried out five equilibrium, and selected part instruction stream fragment is as instruction stream preliminary sample fragment in the middle of the five equilibrium instruction stream fragment later.

Further, the sampling policy among the described S2 also comprises:

Described instruction stream preliminary sample fragment is divided into three parts, removes middle portion, keep two parts of both sides; Two parts of keeping are further divided into three parts separately, and the portion in the middle of removing separately keeps two parts of both sides separately, by that analogy, after K time, will obtain 2 ^KPart instruction stream fragment, described 2 ^KPart instruction stream fragment is instruction stream sample fragment, and wherein, K is the natural number greater than 1.

Further, the umber of described five equilibrium is M part, and with 0,1,2 ..., M-1, M form, successively five equilibrium instruction stream fragment later is numbered, the instruction stream fragment that is numbered even number is given up, as instruction stream preliminary sample fragment, wherein, M is the natural number greater than 1 with the instruction stream fragment that is numbered odd number.

Further, the cut-point among described S3, the S4 is failure event.

Further, the dynamic code analysis module among the described S3 further comprises:

The instruction analysis prediction module is used for analyzing to determine instruction stream sample fragment middle finger makes the analyses and prediction failure event whether occur, and comes split order stream sample fragment according to this failure event;

Storage level prediction module is used for analyzing to determine whether instruction stream sample fragment storage level prediction failure event occurs, and comes split order stream sample fragment according to this failure event;

The buffer consistency prediction module is used for analyzing to determine whether instruction stream sample fragment buffer consistency prediction failure event occurs, and comes split order stream sample fragment according to this failure event;

Network interconnection prediction module is used for analyzing to determine whether instruction stream sample fragment network interconnection prediction failure event occurs, and comes split order stream sample fragment according to this failure event.

Further, the fragment analog module among described S5, the S6 comprises: processor simulation module, network interconnection analog module and storage level analog module.

The multinuclear simulation parallel acceleration method of sample-based provided by the present invention, integrated use the multi-threaded parallel simulation strategy, based on the fractal sampling policy and the interim dividual simulation strategy that improve cantor set, realized the multiple acceleration in the simulation process, than traditional single-threaded simulation strategy, clock period rank simulation strategy, the present invention has the lifting of significant analog rate, has high efficiency; Can adapt to well the demand that current many karyonide system design proposals are optimized; Further, fragment analog module provided by the present invention, such as, processor simulation module, storage level analog module and network interconnection analog module, can change flexibly according to demand the configuration of each module, well adapt to the demand that needs in the current many karyonides system design process are explored a large amount of design proposals.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, the below will do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.

The process flow diagram of the multinuclear simulation parallel acceleration method of a kind of sample-based that Fig. 1 provides for the embodiment of the invention.

The process flow diagram of the sampling policy that Fig. 2 provides for the embodiment of the invention.

Embodiment

For the purpose, technical scheme and the advantage that make the embodiment of the invention is clearer, below in conjunction with the embodiment of the invention and accompanying drawing, the technical scheme in the embodiment of the invention is clearly and completely described.Need to prove that described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.

Embodiment

As shown in Figure 1, the multinuclear of a kind of sample-based that present embodiment provides simulation parallel acceleration method comprises:

S1: selected multithread application is as the multinuclear benchmark;

In order better to assess designed many karyonides systems, need extendability relatively good, and representative multithread application carry out pressure test to it.Relatively authority and multinuclear benchmark commonly used have SPLASH-2, PARSEC etc.Usually the input set of multinuclear benchmark is the bigger the better, and larger input set can be carried out pressure test and extendability test to many karyonide systems better.

Present embodiment has adopted the multithreading phantom frame, and with single-threaded phantom frame relatively commonly used acceleration strategy---sampling policy applies in the multithreading phantom frame, each thread to the multinuclear benchmark is sampled, reduce the instruction number that each thread need to be simulated, thereby cut down on the whole the instruction number that needs simulation.By screening contrast, present embodiment adopts based on cantor set(Cantor's set) fractal sampling policy the instruction stream of each thread is sampled.

Need to prove that Cantor set is the fractal strategy of a kind of determinacy.The below is described in detail the process of this sampling policy.Such as, the instruction stream of each thread begins to be seen makes a whole instruction stream fragment, and the sampling policy of cantor set is that the instruction stream fragment that at every turn will need to sample is divided into three parts, the portion in the middle of removing, keep two parts of both sides as sample, with this recursion cycle.For example, for the first time after the sampling, two instruction stream sample fragments of whole fragment head and the tail will be obtained to be positioned at, these two instruction stream sample fragments are carried out after the same cantor set sampling policy, to obtain four instruction stream sample fragments, by that analogy, will obtain eight instruction stream sample fragments after the sampling for the third time.After k step sampling, will produce 2 ^KIndividual instruction stream sample fragment.

It is because this strategy can obtain the sample fragment of the whole program behavior feature of reflection better that present embodiment is selected to adopt the fractal sampling policy based on cantor set.The analysis showed that most application programs have interim behavioural characteristic, but the instruction number that comprises in these each stages there is very large difference.Fractal sampling policy based on cantor set is a kind of nonuniform sampling strategy, can adapt to well this interim behavioural characteristic of application program.Than systematicness sampling, stochastic sampling, and two-stage sampling etc., not only can determine fast sampling parameter, obtain the sample fragment, and have take to such an extent that the sample fragment has higher representativeness.

In order to catch better the behavioural characteristic of application program, the characteristics that the present invention is directed to multithread application are optimized the fractal sampling policy based on cantor set.Studies show that the intermediate segment of multithread application has more useful information than head and the tail fragment usually, the researcher more pays attention to the analog result of intermediate code fragment.If the first step is just taked the sampling policy of cantor set, the center section of program will be removed, and can cause larger simulation error.In order to keep more useful program information, as shown in Figure 2, present embodiment is adopted as two steps when the instruction stream of each thread is sampled:

The first step: the instruction stream fragment of each thread is carried out five equilibrium, and selected part instruction stream fragment is as instruction stream preliminary sample fragment in the middle of the five equilibrium instruction stream fragment later.Further, the umber of described five equilibrium is that M(need to prove, described M is the natural number greater than 1) part, and with 0,1,2 ..., M-1, M form, successively five equilibrium instruction stream fragment later is numbered, the instruction stream fragment that is numbered even number is given up, will be numbered the instruction stream fragment of odd number as instruction stream preliminary sample fragment.In the present embodiment, preferred M is 12, namely the instruction stream fragment of each thread is divided into 12 parts, and with 0,1,2 ..., 11,12 form, successively five equilibrium 12 parts of instruction stream fragments later are numbered, the instruction stream fragment that is numbered even number is given up, the instruction stream fragment that is numbered odd number is kept as instruction stream preliminary sample fragment.

Second step: described instruction stream preliminary sample fragment is carried out cantor set sampling policy, be about to described instruction stream preliminary sample fragment and be divided into three parts, remove middle portion, keep two parts of both sides; Two parts of keeping are further divided into three parts separately, and the portion in the middle of removing separately keeps two parts of both sides separately, and by that analogy, K(need to prove that described K is the natural number greater than 1) inferior after, will obtain 2 ^KPart instruction stream fragment, described 2 ^KPart instruction stream fragment is instruction stream sample fragment.

Based on improved cantor set sampling policy higher sampling efficiency is arranged, and kept the sample fragment that can reflect whole program behavior feature, higher accuracy is arranged.In addition, this sampling policy also has the characteristic that is independent of microarchitecture, does not namely rely on special microarchitecture, when microarchitecture changes, does not need resampling, has saved expense, has improved efficient.At present, the fractal sampling policy based on cantor set generally is applied in the single-threaded phantom frame, and acceleration effect is obvious, and present embodiment applies to this sampling policy in the multi-threaded parallel phantom frame for the first time, greatly reduces simulated time.This strategy not only can be determined required parameter in the sampling fast, thereby select fast instruction stream sample fragment, and selected instruction stream sample fragment can well keep the behavioural characteristic of program integral body, under the prerequisite of the less program information of loss, greatly reduce the instruction number that needs simulation, significantly improved the speed of simulation assessment.

Present embodiment has been selected interim dividual simulation strategy in order further to improve analog rate, on higher level the instruction stream fragment is simulated, and abandons slower other simulation strategy of clock period level of analog rate.Described interim dividual simulation strategy is multinuclear simulation strategy quick and precisely a kind of and that realize easily, under the prerequisite of the less simulation precision of loss, has not only improved analog rate, and has reduced the difficulty of phantom frame exploitation.Than other simulation strategy of clock period level, be not to carry out the details simulation by the actual implementation status of following the trail of each simulation nuclear instruction pipelining.

The important step of interim dividual simulation strategy is that whole instruction stream is divided into several instruction stream fragments, the foundation of cutting apart is " event ", be that the generation of various failure event is as a cut-point, whole instruction stream is divided into several interim fragments, and described " event " includes but not limited to: cache miss at different levels, instruction branch prediction failure, Load instruction are read etc.The dynamic code analysis module is one of nucleus module in the described interim dividual simulation strategy, as preferably, described dynamic code analysis module further mainly comprises: the instruction analysis prediction module, be used for analyzing to determine instruction stream sample fragment middle finger makes the analyses and prediction failure event whether occur, and come split order stream sample fragment according to this failure event; Storage level prediction module is used for analyzing to determine whether instruction stream sample fragment storage level prediction failure event occurs, and comes split order stream sample fragment according to this failure event; The buffer consistency prediction module is used for analyzing to determine whether instruction stream sample fragment buffer consistency prediction failure event occurs, and comes split order stream sample fragment according to this failure event; Network interconnection prediction module is used for analyzing to determine whether instruction stream sample fragment network interconnection prediction failure event occurs, and comes split order stream sample fragment according to this failure event.

By the mutual cooperation between described instruction branch prediction module, storage level prediction module, buffer consistency prediction module and the network interconnection prediction module, the common analysis determines whether each failure event occurs in the instruction stream, and according to cutting apart of this failure event of whole instruction stream sample fragment, thereby obtain the needed a plurality of discrete segments of the user of system.

The foundation of doing like this is that experimental analysis shows, application program has interim behavioural characteristic, if namely application program is regarded as one whole section instruction stream, this whole section instruction stream can be regarded as by several discrete forming than small instruction flow section, and the instruction in these discrete instruction stream fragments has similar program behavior feature, characteristic parameter when a plurality of instruction stream fragments may have identical operation, such as the simulated time phase etc.Instruction stream sample fragment after cutting apart is exactly to be comprised of a lot of such discrete segments.

The difference of cutting apart the cut-point " event " of foundation by each instruction stream sample fragment, described discrete segments is divided into different groupings, fragment with similar program behavior feature will be divided into one group, be input in the same fragment analog module in the next stage simulation.

As preferably, described fragment analog module further comprises: processor simulation module, network interconnection analog module and storage level analog module.Described processor simulation module, network interconnection analog module and storage level analog module accept respectively to belong to self grouping discrete segments (namely cut apart and divide into groups after instruction stream sample fragment), and derive the simulated time that this fragment is moved required cost according to set strategy.

The principle of these fragment analog modules is according to same fragments for packet similar program behavior feature to be arranged, and identical runtime parameter is arranged, and can derive runtime parameter according to established rule, for example dry run time.What the fragment analog module adopted is not other simulation strategy of clock period level, but is independent of the simulation on microarchitecture higher level, derives fast simulation execution time of each discrete segments by established rule, greatly improves analog rate.As preferably, processor simulation module receives after the instruction branch prediction failure fragment, and according to established rule, the simulated time of this fragment is: the instruction pipelining degree of depth of branch prediction settling time+after branch instruction; The network interconnection analog module receive to need internuclear communication fragment, then provides the simulated time of this fragment according to current routing policy and network congestion degree direct derivation; Storage level analog module receives the level cache deletion fragment, and then simulated time is that set level cache lacks the needed time; The instruction sequence that does not have event to occur in the instruction stream fragment, then simulated time is: every set instruction is carried out the needed time and be multiply by instruction length.

S6: with the simulated time addition of exporting in the fragment analog modules all among the S5, T.T. is carried out in the simulation that draws multithread application among the S1;

By the mutual writing between processor simulation module, network interconnection analog module and the storage level analog module among the S5, jointly finish live after the sampling policy as a result the simulation of instruction stream sample fragment, simulation execution time addition with described processor is simulated each instruction stream sample fragment of module, network interconnection analog module and storage level analog module three output namely obtains whole total simulation execution time of instruction stream sample fragment.

Present embodiment is for current multinuclear, the fast development of many karyonide systems, assist the day by day design optimization space problem of increase of exploration in the urgent need to efficient simulator, a kind of parallel method of accelerating of multinuclear simulation of sample-based is provided, the method integrated use the multi-threaded parallel simulation strategy, based on the fractal sampling policy and the interim dividual simulation strategy that improve cantor set, under the prerequisite of sacrificing less simulation precision, significantly reduced simulated time, improved simulation precision, Performance Evaluation and the design proposal optimization of uniting for many karyonides provide good aid.Simultaneously, the present invention also takes full advantage of current universal multinuclear hardware resource, has further promoted analog rate.

Specifically, the multinuclear simulation parallel acceleration method that present embodiment proposes, integrated use the multi-threaded parallel simulation strategy, based on the fractal sampling policy and the interim dividual simulation strategy that improve cantor set, realized the multiple acceleration in the simulation process, than traditional single-threaded simulation strategy, clock period rank simulation strategy, the present invention has the lifting of significant analog rate, has high efficiency; Can adapt to well the demand that current many karyonide system design proposals are optimized; The fragment analog module that present embodiment provides, such as, processor simulation module, storage level analog module and network interconnection analog module, can change flexibly according to demand the configuration of each module, well adapt to the demand that needs in the current many karyonides system design process are explored a large amount of design proposals; The fractal sampling policy based on improvement cantor set that present embodiment adopts, simple using formula is derived desired parameters in the sampling process fast, and need not repeatedly each sampling parameter of simulation and demonstration, has simplicity.Meanwhile, the interim dividual simulation strategy that present embodiment adopts, than other simulation strategy of clock period level, its phantom frame is easy to realize that size of code is little, has simple ease for use.

The above embodiment has only expressed one embodiment of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to claim of the present invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims

1. the multinuclear of sample-based simulation parallel acceleration method is characterized in that, comprising:

S1: selected multithread application is as the multinuclear benchmark;

2. the multinuclear of sample-based according to claim 1 is simulated parallel acceleration method, it is characterized in that the sampling policy among the described S2 comprises:

3. the multinuclear of sample-based according to claim 2 is simulated parallel acceleration method, it is characterized in that the sampling policy among the described S2 also comprises:

4. the multinuclear of sample-based according to claim 2 is simulated parallel acceleration method, it is characterized in that, the umber of described five equilibrium is M part, and with 0,1,2 ..., M-1, M form, successively five equilibrium instruction stream fragment later is numbered, the instruction stream fragment that is numbered even number is given up, will be numbered the instruction stream fragment of odd number as instruction stream preliminary sample fragment, wherein, M is the natural number greater than 1.

5. the multinuclear of sample-based according to claim 1 simulation parallel acceleration method is characterized in that the cut-point among described S3, the S4 is failure event.

6. the multinuclear of sample-based according to claim 1 is simulated parallel acceleration method, it is characterized in that the dynamic code analysis module among the described S3 further comprises:

7. the multinuclear of sample-based according to claim 1 simulation parallel acceleration method is characterized in that the fragment analog module among described S5, the S6 comprises: processor simulation module, network interconnection analog module and storage level analog module.