CN108519906B - Superscalar out-of-order processor steady state instruction throughput rate modeling method - Google Patents

Superscalar out-of-order processor steady state instruction throughput rate modeling method Download PDF

Info

Publication number
CN108519906B
CN108519906B CN201810229640.8A CN201810229640A CN108519906B CN 108519906 B CN108519906 B CN 108519906B CN 201810229640 A CN201810229640 A CN 201810229640A CN 108519906 B CN108519906 B CN 108519906B
Authority
CN
China
Prior art keywords
instruction
neural network
micro
throughput rate
steady
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810229640.8A
Other languages
Chinese (zh)
Other versions
CN108519906A (en
Inventor
凌明
季柯丞
张凌峰
李宽
时龙兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201810229640.8A priority Critical patent/CN108519906B/en
Publication of CN108519906A publication Critical patent/CN108519906A/en
Application granted granted Critical
Publication of CN108519906B publication Critical patent/CN108519906B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45554Instruction set architectures of guest OS and hypervisor or native processor differ, e.g. Bochs or VirtualPC on PowerPC MacOS

Abstract

The invention discloses a superscalar out-of-order processor steady state instruction throughput rate modeling method, which comprises the steps of obtaining micro-architecture independent parameters related to steady state average throughput rate in each statistical stage, wherein the micro-architecture independent parameters at least comprise dependent link delay distribution; classifying by using a clustering algorithm, and selecting a training set of the neural network; the micro-architecture independent parameters in the training set of the selected neural network are used as the input of the neural network, the thread steady-state instruction throughput rate of the corresponding training set obtained through time sequence accurate simulation is used as the output of the neural network, the input and the output of the neural network are fitted, and the steady-state instruction throughput rate neural network model of the given hardware is obtained through training by adjusting the iteration number, the network topological structure, the transfer function and the preset training precision of the neural network. According to the micro-architecture independent characteristics of the instruction dependent link delay distribution, the instruction throughput rate of a given micro-architecture under the steady state of the superscalar out-of-order processor can be quickly and accurately predicted.

Description

Superscalar out-of-order processor steady state instruction throughput rate modeling method
Technical Field
The invention belongs to the technical field of computer architecture and modeling, and particularly relates to an instruction throughput rate modeling method based on an artificial neural network under a stable state of a superscalar out-of-order processor pipeline.
Background
The architecture evaluation and design space exploration based on hardware behavior modeling can provide chip design guidance opinions and reduce the iteration cycle of chip design. Under the condition that a specific processor and a specified program run, the instruction throughput rate of an out-of-order processor pipeline under a steady state represents the limit of the performance of the processor when no missing event (such as Cache miss, instruction branch prediction miss and the like) occurs, and simultaneously reflects whether the design of an application program is matched with hardware or not to a certain extent. Accurate prediction of instruction throughput in the steady state of the out-of-order processor is the basis of the analysis and modeling of the overall performance of the out-of-order processor.
The average instruction throughput rate at steady state of an out-of-order processor refers to the average number of instructions per clock cycle that are issued without a miss event occurring. Early estimates of steady-state instruction throughput were relatively simple, taking the width of the front-end instruction issue stage directly as the average throughput at steady-state for the out-of-order processor, the method assumed: when no miss event occurs in an out-of-order processor, the processor is able to process an instruction equal to the width of the front-end instruction issue stage per clock cycle. The method neglects the consideration of factors such as instruction dependence, the number and the types of functional units, instruction delay, serial instruction distribution and the like, and is a very ideal and large-error hypothesis.
Although there is an exponential relationship between the instruction throughput rate and the size of the instruction window in the stable state of the out-of-order processor observed in recent research, the specific coefficient can be obtained by fitting after experimental actual measurement. However, disadvantages of this approach include, first, that the steady state instruction throughput rate achieved by this approach is a constant, reflects only an average over a long time period, and lacks dynamics; secondly, the steady-state instruction throughput rate obtained by the method is irrelevant to the specific software load characteristics, the characteristics of different software cannot be reflected, and a large error exists.
The magnitude of the command average throughput rate in the steady state is not a simple single-action relationship with each influencing factor, that is, the coupling effect between each influencing factor also influences the magnitude of the average throughput rate in the steady state, which undoubtedly increases the difficulty of mechanism angle analysis. Due to the fact that the full-function time sequence accurate simulation time overhead is too large. The invention is achieved accordingly.
Disclosure of Invention
In view of the above technical problems, the present invention aims to: the modeling method for the steady-state instruction throughput rate of the superscalar out-of-order processor is provided, the steady-state instruction throughput rate of the superscalar out-of-order processor of a given microarchitecture is rapidly and accurately predicted according to the microarchitecture-independent characteristic that instruction dependent link delay distribution is adopted, and the prediction method is high in precision and speed.
The technical scheme of the invention is as follows:
a superscalar out-of-order processor steady state instruction throughput modeling method, comprising the steps of:
s01: acquiring micro-architecture independent parameters related to steady-state average throughput rate in each statistical stage, wherein the micro-architecture independent parameters at least comprise dependent link delay distribution;
s02: classifying by using a clustering algorithm, and selecting a training set of the neural network;
s03: the micro-architecture independent parameters in the training set of the selected neural network are used as the input of the neural network, the thread steady-state instruction throughput rate of the corresponding training set obtained through time sequence accurate simulation is used as the output of the neural network, the input and the output of the neural network are fitted, and the steady-state instruction throughput rate neural network model of the given hardware is obtained through training by adjusting the iteration number, the network topological structure, the transfer function and the preset training precision of the neural network.
Preferably, the obtaining of the link delay profile in step S01 includes:
s11: determining the dependency relationship between each instruction and other instructions in the instruction window when the instruction enters the instruction window by defining a structural body of the instruction dependent link, and counting the length of the dependent link;
s12: recording the type of each instruction by defining a structural body of the instruction type, and acquiring the time required by instruction execution according to the type of the instruction;
s13: and counting to obtain the dependent link delay distribution.
Preferably, the micro-architecture independent parameters further include a dynamic instruction stream blend ratio, a total time of execution of the target thread, and a total number of instructions executed.
Preferably, before the step S01, the method further includes selecting a suitable fixed time segment length, performing segment cutting on the program execution flow (Trace) every other time segment, dividing the entire target program execution flow into a plurality of segments, collecting a corresponding data set for each program segment, and using each program segment as a statistical stage.
Preferably, before the step S02, the method further includes preprocessing the relevant micro-architecture independent parameter of each segment to form a relevant micro-architecture independent parameter vector of the corresponding segment; and carrying out dimensionality reduction and denoising treatment on the related micro-architecture independent parameter vectors through a data dimensionality reduction algorithm to form a micro-structure independent data set of the corresponding segment.
Preferably, the method further comprises the following steps,
dividing a data set containing all the characteristic vectors in the statistical stage into a plurality of large classes;
for each major class, dividing the major class into minor classes in a certain proportion by using a k-means clustering algorithm;
and selecting the point closest to the central point in each subclass as a feature vector.
Compared with the existing prediction method of the steady-state average throughput rate, the method has the advantages that:
the invention utilizes the proposed dependent link delay distribution to better cover a plurality of microarchitectural independent parameters influencing the steady-state average throughput rate, which comprises the following steps: the dynamic instruction mixing ratio depends on the link delay distribution, and a more accurate steady-state throughput rate model can be established.
In addition, the neural network is adopted to predict the steady-state average instruction throughput rate, the coupling between parameters irrelevant to the microarchitecture can be fully considered, and the value of the steady-state instruction average throughput rate can be accurately and quickly predicted through a trained model.
Drawings
The invention is further described with reference to the following figures and examples:
FIG. 1 is a flow chart of a superscalar out-of-order processor steady state instruction throughput modeling method of the present invention;
FIG. 2 is a detailed flow chart of the method for training an artificial neural network model according to the present invention;
FIG. 3 is a flow chart of a method for relying on link delay profile statistics;
FIG. 4 is a neural network hierarchy diagram;
FIG. 5 is a block diagram of inputs and target outputs for neural network model training and testing.
Detailed Description
The above-described scheme is further illustrated below with reference to specific examples. It should be understood that these examples are for illustrative purposes and are not intended to limit the scope of the present invention. The conditions used in the examples may be further adjusted according to the conditions of the particular manufacturer, and the conditions not specified are generally the conditions in routine experiments.
Example (b):
as shown in FIGS. 1 and 2, the modeling method for the steady state instruction throughput rate of the superscalar out-of-order processor of the invention comprises the following steps:
(1) reasonably selecting a fixed program execution time slice length, and cutting a program operation execution stream according to the selected time slice length when the instruction set simulator simulates, so that the whole target program is divided into a plurality of segments; a respective data set is collected for each program fragment. In this embodiment, the Interval from the start of scheduling execution of the target thread to the cut-out of the thread by the operating system is taken as a statistical phase (Profiling Interval) according to the switching Interval of the thread as the basis of the segment.
(2) And acquiring micro-architecture independent parameters related to the steady-state average throughput rate in each statistical stage by using an instruction set simulator, wherein the micro-architecture independent parameters mainly comprise the delay distribution of a dependent link and the total instruction number of a target program.
Determining the dependency relationship between each instruction and other instructions in the instruction window when the instruction enters the instruction window by defining a structural body of the instruction dependent link, and counting the length of the dependent link; recording the type of each instruction by defining a structural body of the instruction type, and acquiring the time required by instruction execution according to the type of the instruction; the condition of the delay distribution of the dependent link can be obtained by combining the two structural bodies; at the same time, the processor instruction issue stage is monitored and counts are made of the number of instructions issued and the number of clocks spent for a period of time or within a segment of the instruction stream.
For each divided statistical stage, relevant micro-architecture independent parameters of each segment may be obtained, which may include dynamic instruction stream mix ratio (number of floating point, fixed point, SIMD, Load/Store instructions, etc.), total run time of dependent link delay profile and target thread in each instruction window within this statistical stage, and total number of instructions run.
Fig. 3 is a detailed flow diagram of a specific implementation of a dependent link delay profile. When an instruction just enters a window, the type (multi-cycle instruction or single-cycle instruction) of the instruction and the dependency relationship between the instruction and the existing instruction in the window are detected, the length of a dependent link is calculated and statistical analysis is carried out, and after a statistical stage is executed, the delay distribution values of the dependent links with different lengths are counted to obtain a delay distribution graph of the dependent link at the stage.
(3) Considering the requirement of the artificial neural network on input data, firstly, preprocessing relevant micro-architecture independent parameters of code segments of each statistical stage to form micro-architecture independent parameter vectors of corresponding segments; then, performing dimensionality reduction and denoising treatment on each related micro-architecture independent parameter vector through principal component analysis (selecting principal component containing more than 95% of original data to reduce the amount of the original data) to form a MicaData data set (micro-structure independent data set) of a corresponding segment; of course, other data dimension reduction algorithms, such as LDA, etc., may be employed.
In the embodiment, a BP neural network is adopted, and as shown in fig. 4, the invention is based on an empirical formula of the number of hidden layer nodes:
Figure DEST_PATH_IMAGE002
wherein: h represents the number of hidden layer nodes, m represents the number of output layer nodes, n represents the number of input layer nodes, and a represents a constant. The scheme adopts a three-layer neural network structure, wherein the input is dependent link delay distribution, the input layer counts 150 neurons, the middle hidden layer adopts 16 neurons, the output is a steady-state throughput value, and the output layer counts 1 neuron; the training method uses the LM (LevenbergMarquard) algorithm. Logsig transfer function is adopted between the input layer and the hidden layer, purelin transfer function is adopted between the hidden layer and the output layer, and the weight values between nodes of all layers are adjusted by using transcg (quantization conjugate gradient method).
(4) And extracting the characteristic vector of the statistical phase segment reserved in the target thread. Firstly, in the scheme, an SOM (self-organizing mapping network) is selected to divide a data set containing all feature vectors in a statistical stage into 200 large classes (the number of classes can be adjusted according to the classification condition); then, for each large class (assuming that the number of feature vectors in the class is N), dividing the large class into N × 30% small classes by using a K-Means clustering (K-Means clustering) algorithm (the proportion value can be adjusted according to the quality of the experimental situation); finally, selecting the point closest to the central point in each small class as a characteristic vector with a distinct characteristic;
(5) all the feature points selected through the clustering algorithm are used as the input of a BP neural network, the output of the BP neural network is the steady-state average throughput rate of the target thread obtained in the step 2), the input and the output of the BP neural network are fitted, and a pipeline steady-state instruction throughput rate BP neural network model under the current hardware architecture is obtained through training by adjusting the iteration times, the network topology structure, the transfer function and the preset training precision of the BP neural network.
And classifying the data processed by the dimensionality reduction algorithm through a clustering algorithm so as to select representative data as a training set of the neural network. The purpose of selecting the training set by adopting the clustering algorithm is to reduce the scale of the training set as much as possible on the premise of keeping the main information of the original data.
(6) The model obtained by step 5) can be used for pipeline steady-state instruction throughput rate of other software under a given hardware architecture. And (3) running a target program by using an instruction set simulator and adding software for instrumentation, counting data related to delay distribution of the dependent link, preprocessing the obtained data and then importing the preprocessed data into the step 5) to obtain a model, so that the instruction throughput rate of the target thread in the stable state of the out-of-order processor under the current hardware architecture can be predicted quickly and accurately.
FIG. 5 is a block diagram of inputs and target outputs during neural network model training and application. Through full-function time sequence accurate simulator simulation, parameter input (micro-architecture independent parameters) and target output (steady-state throughput rate of instructions) for training a model can be obtained, and therefore the model with higher accuracy is trained; when prediction is carried out (when a model is applied), steady-state average throughput rate values can be rapidly predicted only by obtaining relevant parameters of a target application program through an instruction level simulator or other Trace generators which are faster than a full-function time sequence accurate simulator and then introducing the parameters into the model; in the figure, the solid line part is the flow of the training process, and the dashed line part is the flow of the prediction process.
The above examples are only for illustrating the technical idea and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the content of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims (5)

1. A method for modeling steady state instruction throughput of a superscalar out-of-order processor, comprising the steps of:
s01: acquiring micro-architecture independent parameters related to steady-state average throughput rate in each statistical stage, wherein the micro-architecture independent parameters at least comprise dependent link delay distribution;
the obtaining of the dependent link delay profile in step S01 includes:
s11: determining the dependency relationship between each instruction and other instructions in the instruction window when the instruction enters the instruction window by defining a structural body of the instruction dependent link, and counting the length of the dependent link;
s12: recording the type of each instruction through a structural body defining the type of the instruction, and acquiring the time required by instruction execution according to the type of the instruction;
s13: counting to obtain dependent link delay distribution;
s02: classifying by using a clustering algorithm, and selecting a training set of the neural network;
s03: the micro-architecture independent parameters in the training set of the selected neural network are used as the input of the neural network, the thread steady-state instruction throughput rate of the corresponding training set obtained through time sequence accurate simulation is used as the output of the neural network, the input and the output of the neural network are fitted, and the steady-state instruction throughput rate neural network model of the given hardware is obtained through training by adjusting the iteration number, the network topological structure, the transfer function and the preset training precision of the neural network.
2. The method of superscalar out-of-order processor steady state instruction throughput modeling, as recited in claim 1, wherein said microarchitecture independent parameters further comprise dynamic instruction stream blend ratio, total time of execution of a target thread, and total number of instructions executed.
3. The method according to claim 1, further comprising, prior to step S01, selecting a fixed time slice length, performing slice segmentation on the program execution flow (Trace) every other time slice, dividing the entire target program execution flow into a plurality of slices, collecting a corresponding data set for each program slice, and using each program slice as a statistical stage.
4. The method of modeling steady state instruction throughput of a superscalar out-of-order processor of claim 1, further comprising, prior to step S02, preprocessing the dependent micro-architecture independent parameters of each segment to form dependent micro-architecture independent parameter vectors for the corresponding segment; and carrying out dimensionality reduction and denoising treatment on the related micro-architecture independent parameter vectors through a data dimensionality reduction algorithm to form a micro-structure independent data set of the corresponding segment.
5. The method of superscalar out-of-order processor steady state instruction throughput modeling as recited in claim 4 further comprising,
dividing a data set containing all the characteristic vectors in the statistical stage into a plurality of large classes;
for each major class, dividing the major class into minor classes in a certain proportion by using a k-means clustering algorithm;
and selecting the point closest to the central point in each subclass as a feature vector.
CN201810229640.8A 2018-03-20 2018-03-20 Superscalar out-of-order processor steady state instruction throughput rate modeling method Active CN108519906B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810229640.8A CN108519906B (en) 2018-03-20 2018-03-20 Superscalar out-of-order processor steady state instruction throughput rate modeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810229640.8A CN108519906B (en) 2018-03-20 2018-03-20 Superscalar out-of-order processor steady state instruction throughput rate modeling method

Publications (2)

Publication Number Publication Date
CN108519906A CN108519906A (en) 2018-09-11
CN108519906B true CN108519906B (en) 2022-03-22

Family

ID=63434021

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810229640.8A Active CN108519906B (en) 2018-03-20 2018-03-20 Superscalar out-of-order processor steady state instruction throughput rate modeling method

Country Status (1)

Country Link
CN (1) CN108519906B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102652304A (en) * 2009-12-22 2012-08-29 国际商业机器公司 Predicting and avoiding operand-store-compare hazards in out-of-order microprocessors
CN103577159A (en) * 2012-08-07 2014-02-12 想象力科技有限公司 Multi-stage register renaming using dependency removal
CN105630458A (en) * 2015-12-29 2016-06-01 东南大学—无锡集成电路技术研究所 Prediction method of out-of-order processor steady-state average throughput rate based on artificial neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102652304A (en) * 2009-12-22 2012-08-29 国际商业机器公司 Predicting and avoiding operand-store-compare hazards in out-of-order microprocessors
CN103577159A (en) * 2012-08-07 2014-02-12 想象力科技有限公司 Multi-stage register renaming using dependency removal
CN105630458A (en) * 2015-12-29 2016-06-01 东南大学—无锡集成电路技术研究所 Prediction method of out-of-order processor steady-state average throughput rate based on artificial neural network

Also Published As

Publication number Publication date
CN108519906A (en) 2018-09-11

Similar Documents

Publication Publication Date Title
Ïpek et al. Efficiently exploring architectural design spaces via predictive modeling
US7802236B2 (en) Method and apparatus for identifying similar regions of a program's execution
WO2020237729A1 (en) Virtual machine hybrid standby dynamic reliability assessment method based on mode transfer
CN110149237A (en) A kind of Hadoop platform calculate node load predicting method
US8392168B2 (en) Simulating an application during a sampling period and a non-sampling period
CN111737078B (en) Load type-based adaptive cloud server energy consumption measuring and calculating method, system and equipment
CN107247651A (en) Cloud computing platform monitoring and pre-warning method and system
CN110047291A (en) A kind of Short-time Traffic Flow Forecasting Methods considering diffusion process
CN105630458B (en) The Forecasting Methodology of average throughput under a kind of out-of order processor stable state based on artificial neural network
CN103678004A (en) Host load prediction method based on unsupervised feature learning
CN117041017B (en) Intelligent operation and maintenance management method and system for data center
Xu et al. Laser: A deep learning approach for speculative execution and replication of deadline-critical jobs in cloud
Muruganandam et al. Dynamic Ensemble Multivariate Time Series Forecasting Model for PM2. 5.
CN108519906B (en) Superscalar out-of-order processor steady state instruction throughput rate modeling method
CN110377525B (en) Parallel program performance prediction system based on runtime characteristics and machine learning
Ismaeel et al. Real-time energy-conserving vm-provisioning framework for cloud-data centers
Ding et al. NURD: Negative-Unlabeled Learning for Online Datacenter Straggler Prediction
JP2004078338A (en) Method and system for evaluating computer performance
CN107769987A (en) A kind of message forwarding performance appraisal procedure and device
EP3518153A1 (en) Information processing method and information processing system
CN110750856B (en) Effective instruction window size assessment method based on machine learning
CN111694712B (en) Dynamic self-adaptive power consumption measuring method, system and medium for CPU and memory on multiple computing nodes
Bobrek et al. Shared resource access attributes for high-level contention models
Faroudja et al. Decision tree based system on chip for forest fires prediction
Dai et al. Sampling-based approaches to accelerate network-on-chip simulation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant