CN108519906A

CN108519906A - Superscale out-of order processor stable state instructs throughput modeling method

Info

Publication number: CN108519906A
Application number: CN201810229640.8A
Authority: CN
Inventors: 凌明; 季柯丞; 张凌峰; 李宽; 时龙兴
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2018-03-20
Filing date: 2018-03-20
Publication date: 2018-09-11
Anticipated expiration: 2038-03-20
Also published as: CN108519906B

Abstract

The invention discloses a kind of superscale out-of order processor stable states to instruct throughput modeling method, obtains each statistics stage and the relevant micro-architecture independent parameter of stable state average throughput, and the micro-architecture independent parameter, which includes at least, relies on link delay distribution；Classified using clustering algorithm, selection obtains the training set of neural network；Using the micro-architecture independent parameter in the training set of selected neural network as the input of neural network, the thread steady state instruction throughput that corresponding training set is obtained by sequential accurate simulation is used as to the output of neural network, outputting and inputting for neural network is fitted, by adjusting iterations, network topology structure, transmission function and the default training precision of neural network, training obtains the steady state instruction throughput neural network model of given hardware.The unrelated feature of the micro-architecture of link delay distribution is relied on according to instruction, can rapidly and accurately predict the instruction throughput under the superscale out-of order processor stable state of given micro-architecture.

Description

Superscale out-of order processor stable state instructs throughput modeling method

Technical field

The invention belongs to Computer Architectures and modeling technique field, and artificial neural network is based on more particularly to one kind Superscale out-of order processor assembly line stable state give an order throughput modeling method.

Background technology

Framework assessment and design space exploration based on hardware behavior modeling can provide chip design guidance opinion, reduce core The piece design iteration period.In the case where par-ticular processor and designated program are run, the finger under out-of order processor assembly line stable state Throughput is enabled to characterize in no deletion events（Such as Cache missings, instruction branch prediction missing etc.）Processor when generation The limit of energy, while whether the design for also reflecting application program to a certain extent is adapted to hardware.For out-of order processor The Accurate Prediction of instruction throughput under stable state is the basis of out-of order processor overall performance analysis modeling.

Averaging instruction throughput under out-of order processor stable state refers to average every in the case where not having deletion events The number of instructions of a clock cycle transmitting.Early stage is fairly simple about the estimation of steady state instruction throughput, directly instructs front end The width of emitting stage is assumed as the average throughput under out-of order processor stable state, this method：When out-of order processor does not lack When event occurs, the instruction with front end instruction issue level width equivalent can be handled in processor each clock cycle.This side Method has ignored the considerations of to factors such as instruction dependence, functional unit value volume and range of product, instruction delay, serial command distributions, is one Kind idealization and the prodigious hypothesis of error.

Although there is research to observe the size for instructing throughput and instruction window under out-of order processor stable state in the recent period There are exponential relationships, and specific coefficient can be by being fitted to obtain after testing actual measurement.However, the shortcomings that this method, wraps It includes, first, the steady state instruction throughput that this method obtains is a constant, can only reflect being averaged in a Long time scale Value lacks dynamic；Second, the steady state instruction throughput that this method obtains is unrelated with specific software load feature, cannot Reflect the feature of different software, there are bigger errors.

Stable state give an order average throughput size and each influence factor between be not simple single interactively, Coupling effect between i.e. each factor also in the case where affecting stable state average throughput size, this undoubtedly increases mechanistic point The difficulty of analysis.Since the accurate type simulation time expense of global function sequential is excessive.The present invention is therefore.

Invention content

For the above technical problems, purpose of the present invention is to：Provide a kind of superscale out-of order processor stabilization shape State instructs throughput modeling method, and relying on link delay according to instruction is distributed the unrelated feature of this micro-architecture, rapidly and accurately Prediction gives the instruction throughput under the superscale out-of order processor stable state of micro-architecture, and prediction technique precision is high, speed is fast.

The technical scheme is that：

A kind of superscale out-of order processor stable state instruction throughput modeling method, includes the following steps：

S01：Obtain each statistics stage and the relevant micro-architecture independent parameter of stable state average throughput, the unrelated ginseng of micro-architecture Number, which includes at least, relies on link delay distribution；

S02：Classified using clustering algorithm, selection obtains the training set of neural network；

S03：Using the micro-architecture independent parameter in the training set of selected neural network as the input of neural network, will pass through Sequential accurate simulation obtains output of the thread steady state instruction throughput of corresponding training set as neural network, to neural network It outputs and inputs and is fitted, by the iterations, network topology structure, transmission function and the default instruction that adjust neural network Practice precision, training obtains the steady state instruction throughput neural network model of given hardware.

Preferably, the acquisition of link delay distribution is relied in the step S01, including：

S11：The structure for relying on link is instructed by defining, when determining every instruction entry instruction window in its own and window The dependence of other instructions, and count the size for relying on linkage length；

S12：By defining the structure of instruction type, the type of every instruction is recorded, while can obtain according to the type of instruction Call instruction executes the required time；

S13：Statistics obtains relying on link delay distribution.

Preferably, the micro-architecture independent parameter further includes the operation total time of dynamic instruction flow mixing ratio, subject thread And total number of instructions of operation.

Preferably, further include choosing suitable set time fragment length before the step S01, every a time Segment executes stream to program（Trace）Segment cutting is carried out, entire target program execution stream is divided into several segments, for every A usability of program fragments collects corresponding data set, using each usability of program fragments as a statistics stage.

Preferably, further include being located in advance to the related micro-architecture independent parameter of each segment before the step S02 Reason forms the related micro-architecture independent parameter vector of homologous segment；By Data Dimensionality Reduction Algorithm to related micro-architecture independent parameter Vector carries out dimensionality reduction, denoising, forms the micro-structure extraneous data collection of homologous segment.

Preferably, further include,

Data set comprising statistics stage all feature vectors is divided into multiple major class；

For each major class, major class is divided into a certain proportion of group using k- means clustering algorithms；

Choose in each group with a distance from central point at nearest o'clock as a feature vector.

Compared with the prediction technique of existing stable state average throughput, it is an advantage of the invention that：

The present invention utilizes proposed dependence link delay to be distributed and preferably covers the multiple micro- of influence stable state average throughput Framework independent parameter, including：Dynamic instruction mixing ratio relies on link delay distribution, can establish and more accurately stablize State throughput model.

In addition, the present invention predicts stable state averaging instruction throughput using neural network, can adequately take into account Coupling between micro-architecture independent parameter, and steady state instruction can quickly and accurately be predicted by trained model and put down The value of equal throughput.

Description of the drawings

The invention will be further described with reference to the accompanying drawings and embodiments：

Fig. 1 is the flow chart that superscale out-of order processor stable state of the present invention instructs throughput modeling method；

Fig. 2 is the particular flow sheet using present invention training artificial nerve network model；

Fig. 3 is the flow chart for relying on link delay distribution statistical method；

Fig. 4 is neural network level figure；

Fig. 5 is that neural network model training, the input of test and target export block diagram.

Specific implementation mode

Said program is described further below in conjunction with specific embodiment.It should be understood that these embodiments are for illustrating The present invention and be not limited to limit the scope of the invention.The implementation condition used in embodiment can be done according to the condition of specific producer Further adjustment, the implementation condition being not specified is usually the condition in routine experiment.

Embodiment：

As shown in Figure 1, 2, superscale out-of order processor stable state of the invention instructs throughput modeling method, including walks as follows Suddenly：

（1）The fixed program execution time leaf length of Rational choice, when instruction set simulator emulates, according to the timeslice of selection Length runs execution stream to program and cuts, to which entire target program is divided into several segments；For each slice Section collects corresponding data set.In the present embodiment, we according to thread switching interval as segmentation foundation, from target Thread starts scheduling and goes to time interval that the thread is cut out by operating system as a statistics stage（Profiling Interval）.

（2）Each statistics stage ginseng unrelated with the relevant micro-architecture of stable state average throughput is obtained using instruction set simulator Number, main total number of instructions including relying on link delay distribution, target program.

The structure for relying on link is instructed by defining, when determining every instruction entry instruction window in its own and window The dependence of other instructions, and count the size for relying on linkage length；By defining the structure of instruction type, every is recorded The type of instruction, while the instruction execution required time can be obtained according to the type of instruction；In conjunction with above-mentioned two structure The case where relying on link delay distribution can be counted to obtain；At the same time, when monitoring processor instruction emitting stage and counting one section Between or an instruction stream segment in the clock number of number of instructions and cost launched.

To each ready-portioned statistics stage, the related micro-architecture independent parameter of each segment can be obtained, micro-architecture without Related parameter may include dynamic instruction flow mixing ratio（Floating-point, fixed point, the number etc. of SIMD, Load/Store instruction）, at this Link delay distribution and the operation total time of subject thread and total finger of operation are relied in the statistics stage in each instruction window Enable number.

Fig. 3 is the detail flowchart for relying on link delay distribution specific implementation.When instruction has just enter into window, detection instruction The type of itself（Multi-cycle instructions or one-cycle instruction）And its and window in dependence between existing instruction, calculate and rely on Linkage length simultaneously does statistical analysis, after being finished in a statistics stage, counts the dependence link delay distribution of different length Value, obtains the dependence link delay staple diagram in the stage.

（3）Requirement in view of artificial neural network to input data, first, to the code snippet in each statistics stage Related micro-architecture independent parameter is pre-processed, and the micro-architecture independent parameter vector of homologous segment is formed；Then, pass through principal component Analysis（The pivot ingredient for including 95% or more initial data is chosen, original data volume is reduced）To the unrelated ginseng of each related micro-architecture Number vector carries out dimensionality reduction, denoising, forms the MicaData data sets of homologous segment（Micro-structure extraneous data collection）；Certainly, Other Data Dimensionality Reduction Algorithms, such as LDA may be used.

The present embodiment uses BP neural network, as shown in figure 4, the present invention is according to hidden layer node number empirical equation：

Wherein：H indicates that node in hidden layer, m indicate that output layer number of nodes, n indicate that input layer number, a indicate a constant. This case uses three-layer neural network structure, wherein input is relies on link delay distribution, input layer amounts to 150 neurons, in Between hidden layer use 16 neurons, export as stable state throughput value, output layer amounts to 1 neuron；Training method uses LM （LevenbergMarquard）Algorithm.Use logsig transmission functions between input layer and hidden layer, hidden layer and output layer it Between use purelin transmission functions, the weighted value between each layer node uses trainscg（Scaled Conjugate Gradient Method）It carries out It adjusts.

（4）To the statistics stage snippet extraction feature vector remained in subject thread.First, SOM is chosen in this case （SelfOrganizingFeatureMaps, self-organized mapping network）The data set of statistics stage all feature vectors will be included It is divided into 200 major class（The number of classification can be adjusted according to the quality of classification situation）；Then, for each major class（It is false If the feature vector number in class is N）, use k- mean clusters（K-Means is clustered）Major class is divided into N*30% group by algorithm （The value of ratio can be adjusted according to the quality of experimental conditions）；Finally, it chooses nearest with a distance from central point in each group Point as one have distinct characteristic feature vector；

（5）All characteristic points chosen by clustering algorithm are used as to the input of BP neural network, the output of BP neural network is In step 2）The subject thread stable state average throughput of middle acquisition, is fitted outputting and inputting for BP neural network, passes through Iterations, network topology structure, transmission function and the default training precision of BP neural network are adjusted, training obtains current hard Assembly line steady state instruction throughput BP neural network model under part framework.

By the way that clustering algorithm will treated that data classify to choose representative number by dimension-reduction algorithm According to the training set as neural network.The purpose that training set is chosen using clustering algorithm is to retain initial data main information Under the premise of reduce the scale of training set as far as possible.

（6）By step 5）The model obtained can be used for assembly line stable state of the other software under given hardware structure Instruct throughput.Using instruction set simulator operational objective program and software stub is added, statistics relies on link delay and is distributed phase Close data, steps for importing 5 after being pre-processed to obtained data）Obtain model, you can quickly and accurately predict score Instruction throughput under out-of order processor stable state of the journey under Current hardware framework.

Input when Fig. 5 is neural network model training and application exports block diagram with target.It is accurate by global function sequential Type simulator emulates, we are available for the parameter input of training pattern（Micro-architecture independent parameter）It is exported with target（Refer to The stable state throughput of order）, to train the higher model of precision；When being predicted（When application model）, it is only necessary to Show that target is answered by comparing the much faster instruction-level simulator of the accurate type simulator of global function sequential or other Trace generators With the relevant parameter of program, these parameters are then imported into model, so that it may rapidly to predict stable state average throughput value；Figure Middle bold portion is the flow of training process, and dotted portion is the flow of prediction process.

The foregoing examples are merely illustrative of the technical concept and features of the invention, its object is to allow the person skilled in the art to be It cans understand the content of the present invention and implement it accordingly, it is not intended to limit the scope of the present invention.It is all smart according to the present invention The equivalent transformation or modification that refreshing essence is done, should be covered by the protection scope of the present invention.

Claims

1. a kind of superscale out-of order processor stable state instructs throughput modeling method, which is characterized in that include the following steps：

2. superscale out-of order processor stable state according to claim 1 instructs throughput modeling method, feature to exist In, the acquisition of link delay distribution is relied in the step S01, including：

S13：Statistics obtains relying on link delay distribution.

3. superscale out-of order processor stable state according to claim 1 instructs throughput modeling method, feature to exist In the micro-architecture independent parameter further includes the total of dynamic instruction flow mixing ratio, the operation total time of subject thread and operation Number of instructions.

4. superscale out-of order processor stable state according to claim 1 instructs throughput modeling method, feature to exist In the step S01 further includes before choosing suitable set time fragment length, is held to program every a time slice Row stream（Trace）Segment cutting is carried out, entire target program execution stream is divided into several segments, is received for each usability of program fragments Collect corresponding data set, using each usability of program fragments as a statistics stage.

5. superscale out-of order processor stable state according to claim 1 instructs throughput modeling method, feature to exist In, further include being pre-processed to the related micro-architecture independent parameter of each segment before the step S02, formation counterpiece The related micro-architecture independent parameter vector of section；By Data Dimensionality Reduction Algorithm to related micro-architecture independent parameter vector carry out dimensionality reduction, Denoising forms the micro-structure extraneous data collection of homologous segment.

6. superscale out-of order processor stable state according to claim 5 instructs throughput modeling method, feature to exist In, further include,