CN108519906B

CN108519906B - Superscalar out-of-order processor steady state instruction throughput rate modeling method

Info

Publication number: CN108519906B
Application number: CN201810229640.8A
Authority: CN
Inventors: 凌明; 季柯丞; 张凌峰; 李宽; 时龙兴
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2018-03-20
Filing date: 2018-03-20
Publication date: 2022-03-22
Anticipated expiration: 2038-03-20
Also published as: CN108519906A

Abstract

The invention discloses a superscalar out-of-order processor steady state instruction throughput rate modeling method, which comprises the steps of obtaining micro-architecture independent parameters related to steady state average throughput rate in each statistical stage, wherein the micro-architecture independent parameters at least comprise dependent link delay distribution; classifying by using a clustering algorithm, and selecting a training set of the neural network; the micro-architecture independent parameters in the training set of the selected neural network are used as the input of the neural network, the thread steady-state instruction throughput rate of the corresponding training set obtained through time sequence accurate simulation is used as the output of the neural network, the input and the output of the neural network are fitted, and the steady-state instruction throughput rate neural network model of the given hardware is obtained through training by adjusting the iteration number, the network topological structure, the transfer function and the preset training precision of the neural network. According to the micro-architecture independent characteristics of the instruction dependent link delay distribution, the instruction throughput rate of a given micro-architecture under the steady state of the superscalar out-of-order processor can be quickly and accurately predicted.

Description

Superscalar out-of-order processor steady state instruction throughput rate modeling method

Technical Field

The invention belongs to the technical field of computer architecture and modeling, and particularly relates to an instruction throughput rate modeling method based on an artificial neural network under a stable state of a superscalar out-of-order processor pipeline.

Background

The architecture evaluation and design space exploration based on hardware behavior modeling can provide chip design guidance opinions and reduce the iteration cycle of chip design. Under the condition that a specific processor and a specified program run, the instruction throughput rate of an out-of-order processor pipeline under a steady state represents the limit of the performance of the processor when no missing event (such as Cache miss, instruction branch prediction miss and the like) occurs, and simultaneously reflects whether the design of an application program is matched with hardware or not to a certain extent. Accurate prediction of instruction throughput in the steady state of the out-of-order processor is the basis of the analysis and modeling of the overall performance of the out-of-order processor.

The average instruction throughput rate at steady state of an out-of-order processor refers to the average number of instructions per clock cycle that are issued without a miss event occurring. Early estimates of steady-state instruction throughput were relatively simple, taking the width of the front-end instruction issue stage directly as the average throughput at steady-state for the out-of-order processor, the method assumed: when no miss event occurs in an out-of-order processor, the processor is able to process an instruction equal to the width of the front-end instruction issue stage per clock cycle. The method neglects the consideration of factors such as instruction dependence, the number and the types of functional units, instruction delay, serial instruction distribution and the like, and is a very ideal and large-error hypothesis.

Although there is an exponential relationship between the instruction throughput rate and the size of the instruction window in the stable state of the out-of-order processor observed in recent research, the specific coefficient can be obtained by fitting after experimental actual measurement. However, disadvantages of this approach include, first, that the steady state instruction throughput rate achieved by this approach is a constant, reflects only an average over a long time period, and lacks dynamics; secondly, the steady-state instruction throughput rate obtained by the method is irrelevant to the specific software load characteristics, the characteristics of different software cannot be reflected, and a large error exists.

The magnitude of the command average throughput rate in the steady state is not a simple single-action relationship with each influencing factor, that is, the coupling effect between each influencing factor also influences the magnitude of the average throughput rate in the steady state, which undoubtedly increases the difficulty of mechanism angle analysis. Due to the fact that the full-function time sequence accurate simulation time overhead is too large. The invention is achieved accordingly.

Disclosure of Invention

In view of the above technical problems, the present invention aims to: the modeling method for the steady-state instruction throughput rate of the superscalar out-of-order processor is provided, the steady-state instruction throughput rate of the superscalar out-of-order processor of a given microarchitecture is rapidly and accurately predicted according to the microarchitecture-independent characteristic that instruction dependent link delay distribution is adopted, and the prediction method is high in precision and speed.

The technical scheme of the invention is as follows:

a superscalar out-of-order processor steady state instruction throughput modeling method, comprising the steps of:

s01: acquiring micro-architecture independent parameters related to steady-state average throughput rate in each statistical stage, wherein the micro-architecture independent parameters at least comprise dependent link delay distribution;

s02: classifying by using a clustering algorithm, and selecting a training set of the neural network;

s03: the micro-architecture independent parameters in the training set of the selected neural network are used as the input of the neural network, the thread steady-state instruction throughput rate of the corresponding training set obtained through time sequence accurate simulation is used as the output of the neural network, the input and the output of the neural network are fitted, and the steady-state instruction throughput rate neural network model of the given hardware is obtained through training by adjusting the iteration number, the network topological structure, the transfer function and the preset training precision of the neural network.

Preferably, the obtaining of the link delay profile in step S01 includes:

s11: determining the dependency relationship between each instruction and other instructions in the instruction window when the instruction enters the instruction window by defining a structural body of the instruction dependent link, and counting the length of the dependent link;

s12: recording the type of each instruction by defining a structural body of the instruction type, and acquiring the time required by instruction execution according to the type of the instruction;

s13: and counting to obtain the dependent link delay distribution.

Preferably, the micro-architecture independent parameters further include a dynamic instruction stream blend ratio, a total time of execution of the target thread, and a total number of instructions executed.

Preferably, before the step S01, the method further includes selecting a suitable fixed time segment length, performing segment cutting on the program execution flow (Trace) every other time segment, dividing the entire target program execution flow into a plurality of segments, collecting a corresponding data set for each program segment, and using each program segment as a statistical stage.

Preferably, before the step S02, the method further includes preprocessing the relevant micro-architecture independent parameter of each segment to form a relevant micro-architecture independent parameter vector of the corresponding segment; and carrying out dimensionality reduction and denoising treatment on the related micro-architecture independent parameter vectors through a data dimensionality reduction algorithm to form a micro-structure independent data set of the corresponding segment.

Preferably, the method further comprises the following steps,

dividing a data set containing all the characteristic vectors in the statistical stage into a plurality of large classes;

for each major class, dividing the major class into minor classes in a certain proportion by using a k-means clustering algorithm;

and selecting the point closest to the central point in each subclass as a feature vector.

Compared with the existing prediction method of the steady-state average throughput rate, the method has the advantages that:

the invention utilizes the proposed dependent link delay distribution to better cover a plurality of microarchitectural independent parameters influencing the steady-state average throughput rate, which comprises the following steps: the dynamic instruction mixing ratio depends on the link delay distribution, and a more accurate steady-state throughput rate model can be established.

In addition, the neural network is adopted to predict the steady-state average instruction throughput rate, the coupling between parameters irrelevant to the microarchitecture can be fully considered, and the value of the steady-state instruction average throughput rate can be accurately and quickly predicted through a trained model.

Drawings

The invention is further described with reference to the following figures and examples:

FIG. 1 is a flow chart of a superscalar out-of-order processor steady state instruction throughput modeling method of the present invention;

FIG. 2 is a detailed flow chart of the method for training an artificial neural network model according to the present invention;

FIG. 3 is a flow chart of a method for relying on link delay profile statistics;

FIG. 4 is a neural network hierarchy diagram;

FIG. 5 is a block diagram of inputs and target outputs for neural network model training and testing.

Detailed Description

The above-described scheme is further illustrated below with reference to specific examples. It should be understood that these examples are for illustrative purposes and are not intended to limit the scope of the present invention. The conditions used in the examples may be further adjusted according to the conditions of the particular manufacturer, and the conditions not specified are generally the conditions in routine experiments.

Example (b):

as shown in FIGS. 1 and 2, the modeling method for the steady state instruction throughput rate of the superscalar out-of-order processor of the invention comprises the following steps:

(1) reasonably selecting a fixed program execution time slice length, and cutting a program operation execution stream according to the selected time slice length when the instruction set simulator simulates, so that the whole target program is divided into a plurality of segments; a respective data set is collected for each program fragment. In this embodiment, the Interval from the start of scheduling execution of the target thread to the cut-out of the thread by the operating system is taken as a statistical phase (Profiling Interval) according to the switching Interval of the thread as the basis of the segment.

(2) And acquiring micro-architecture independent parameters related to the steady-state average throughput rate in each statistical stage by using an instruction set simulator, wherein the micro-architecture independent parameters mainly comprise the delay distribution of a dependent link and the total instruction number of a target program.

Determining the dependency relationship between each instruction and other instructions in the instruction window when the instruction enters the instruction window by defining a structural body of the instruction dependent link, and counting the length of the dependent link; recording the type of each instruction by defining a structural body of the instruction type, and acquiring the time required by instruction execution according to the type of the instruction; the condition of the delay distribution of the dependent link can be obtained by combining the two structural bodies; at the same time, the processor instruction issue stage is monitored and counts are made of the number of instructions issued and the number of clocks spent for a period of time or within a segment of the instruction stream.

For each divided statistical stage, relevant micro-architecture independent parameters of each segment may be obtained, which may include dynamic instruction stream mix ratio (number of floating point, fixed point, SIMD, Load/Store instructions, etc.), total run time of dependent link delay profile and target thread in each instruction window within this statistical stage, and total number of instructions run.

Fig. 3 is a detailed flow diagram of a specific implementation of a dependent link delay profile. When an instruction just enters a window, the type (multi-cycle instruction or single-cycle instruction) of the instruction and the dependency relationship between the instruction and the existing instruction in the window are detected, the length of a dependent link is calculated and statistical analysis is carried out, and after a statistical stage is executed, the delay distribution values of the dependent links with different lengths are counted to obtain a delay distribution graph of the dependent link at the stage.

(3) Considering the requirement of the artificial neural network on input data, firstly, preprocessing relevant micro-architecture independent parameters of code segments of each statistical stage to form micro-architecture independent parameter vectors of corresponding segments; then, performing dimensionality reduction and denoising treatment on each related micro-architecture independent parameter vector through principal component analysis (selecting principal component containing more than 95% of original data to reduce the amount of the original data) to form a MicaData data set (micro-structure independent data set) of a corresponding segment; of course, other data dimension reduction algorithms, such as LDA, etc., may be employed.

In the embodiment, a BP neural network is adopted, and as shown in fig. 4, the invention is based on an empirical formula of the number of hidden layer nodes:

wherein: h represents the number of hidden layer nodes, m represents the number of output layer nodes, n represents the number of input layer nodes, and a represents a constant. The scheme adopts a three-layer neural network structure, wherein the input is dependent link delay distribution, the input layer counts 150 neurons, the middle hidden layer adopts 16 neurons, the output is a steady-state throughput value, and the output layer counts 1 neuron; the training method uses the LM (LevenbergMarquard) algorithm. Logsig transfer function is adopted between the input layer and the hidden layer, purelin transfer function is adopted between the hidden layer and the output layer, and the weight values between nodes of all layers are adjusted by using transcg (quantization conjugate gradient method).

(4) And extracting the characteristic vector of the statistical phase segment reserved in the target thread. Firstly, in the scheme, an SOM (self-organizing mapping network) is selected to divide a data set containing all feature vectors in a statistical stage into 200 large classes (the number of classes can be adjusted according to the classification condition); then, for each large class (assuming that the number of feature vectors in the class is N), dividing the large class into N × 30% small classes by using a K-Means clustering (K-Means clustering) algorithm (the proportion value can be adjusted according to the quality of the experimental situation); finally, selecting the point closest to the central point in each small class as a characteristic vector with a distinct characteristic;

(5) all the feature points selected through the clustering algorithm are used as the input of a BP neural network, the output of the BP neural network is the steady-state average throughput rate of the target thread obtained in the step 2), the input and the output of the BP neural network are fitted, and a pipeline steady-state instruction throughput rate BP neural network model under the current hardware architecture is obtained through training by adjusting the iteration times, the network topology structure, the transfer function and the preset training precision of the BP neural network.

And classifying the data processed by the dimensionality reduction algorithm through a clustering algorithm so as to select representative data as a training set of the neural network. The purpose of selecting the training set by adopting the clustering algorithm is to reduce the scale of the training set as much as possible on the premise of keeping the main information of the original data.

(6) The model obtained by step 5) can be used for pipeline steady-state instruction throughput rate of other software under a given hardware architecture. And (3) running a target program by using an instruction set simulator and adding software for instrumentation, counting data related to delay distribution of the dependent link, preprocessing the obtained data and then importing the preprocessed data into the step 5) to obtain a model, so that the instruction throughput rate of the target thread in the stable state of the out-of-order processor under the current hardware architecture can be predicted quickly and accurately.

FIG. 5 is a block diagram of inputs and target outputs during neural network model training and application. Through full-function time sequence accurate simulator simulation, parameter input (micro-architecture independent parameters) and target output (steady-state throughput rate of instructions) for training a model can be obtained, and therefore the model with higher accuracy is trained; when prediction is carried out (when a model is applied), steady-state average throughput rate values can be rapidly predicted only by obtaining relevant parameters of a target application program through an instruction level simulator or other Trace generators which are faster than a full-function time sequence accurate simulator and then introducing the parameters into the model; in the figure, the solid line part is the flow of the training process, and the dashed line part is the flow of the prediction process.

The above examples are only for illustrating the technical idea and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the content of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A method for modeling steady state instruction throughput of a superscalar out-of-order processor, comprising the steps of:

the obtaining of the dependent link delay profile in step S01 includes:

s12: recording the type of each instruction through a structural body defining the type of the instruction, and acquiring the time required by instruction execution according to the type of the instruction;

s13: counting to obtain dependent link delay distribution;

2. The method of superscalar out-of-order processor steady state instruction throughput modeling, as recited in claim 1, wherein said microarchitecture independent parameters further comprise dynamic instruction stream blend ratio, total time of execution of a target thread, and total number of instructions executed.

3. The method according to claim 1, further comprising, prior to step S01, selecting a fixed time slice length, performing slice segmentation on the program execution flow (Trace) every other time slice, dividing the entire target program execution flow into a plurality of slices, collecting a corresponding data set for each program slice, and using each program slice as a statistical stage.

4. The method of modeling steady state instruction throughput of a superscalar out-of-order processor of claim 1, further comprising, prior to step S02, preprocessing the dependent micro-architecture independent parameters of each segment to form dependent micro-architecture independent parameter vectors for the corresponding segment; and carrying out dimensionality reduction and denoising treatment on the related micro-architecture independent parameter vectors through a data dimensionality reduction algorithm to form a micro-structure independent data set of the corresponding segment.

5. The method of superscalar out-of-order processor steady state instruction throughput modeling as recited in claim 4 further comprising,