CN111159011A - Instruction vulnerability prediction method and system based on deep random forest - Google Patents
Instruction vulnerability prediction method and system based on deep random forest Download PDFInfo
- Publication number
- CN111159011A CN111159011A CN201911248246.XA CN201911248246A CN111159011A CN 111159011 A CN111159011 A CN 111159011A CN 201911248246 A CN201911248246 A CN 201911248246A CN 111159011 A CN111159011 A CN 111159011A
- Authority
- CN
- China
- Prior art keywords
- instruction
- vulnerability
- forest
- sample
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3604—Software analysis for verifying properties of programs
- G06F11/3608—Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Computer Hardware Design (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method and a system for predicting instruction vulnerability based on a deep random forest, wherein the method comprises the following steps: extracting instruction characteristic information related to each program instruction and instruction vulnerability, and generating an instruction characteristic vector representing the instruction vulnerability; fault injection is carried out on the training program, and vulnerability values of all program instructions are obtained; combining the instruction characteristic vector and the instruction vulnerability value to generate an instruction vulnerability sample data set; performing sliding sampling on the instruction vulnerability sample data set through a sliding window to generate an expanded sample data set; constructing and training an instruction vulnerability prediction model based on a deep random forest; and extracting the instruction feature vector of the target program to be predicted, and realizing the instruction vulnerability prediction of the target program to be predicted by combining an instruction vulnerability prediction model. The system is used for realizing the method. The method has the advantages of high prediction accuracy, low demand on the sample set and less manual adjustment work, and can be effectively applied to prediction of instruction vulnerability after the program is influenced by instantaneous faults.
Description
Technical Field
The invention belongs to the field of software reinforcement and software reliability, and particularly relates to a method and a system for predicting instruction vulnerability based on deep random forest.
Background
With the rapid development of semiconductor manufacturing processes, the size of a computer chip is continuously reduced, so that the sensitivity of the computer chip to spatial radiation is greatly improved. Under the environment of space radiation, the single event upset effect generated by high-energy particle irradiation or electromagnetic interference on an integrated circuit chip with a high process level is one of the main reasons for the failure of a computer system. The Single Event Upset (SEU) effect refers to a phenomenon in which a certain bit of a memory value is affected and logic state inversion occurs, and this phenomenon is generally referred to as a soft error.
Soft errors are generally classified into (1) the occurrence of soft errors has no influence on the normal operation of a program; (2) the occurrence of a soft error causes a program to crash or hang; (3) the occurrence of a soft error causes an implicit error to occur in the program, i.e., the program operates normally, but the operation results in an error, and such an error is generally referred to as sdc (simple Data correction). The first type of error does not affect program operation, and the second type of error is serious but easy to detect. Compared to these two types of errors, SDC will cause more serious program problems due to its implicit propagation properties.
For software SDC errors, the conventional error detection method based on redundant instructions copies all instructions in a program, which results in huge performance, so that research on redundancy technologies at present focuses on how to select fragile instructions in the program for partial redundancy, so as to achieve the purpose of reducing overhead. Existing selection methods can be divided into three categories: (1) a selection method based on fault injection; (2) a selection method based on program analysis; (3) a selection method based on machine learning. The fault injection-based instruction selection method comprises the steps of fault injection of program instructions one by one, screening of instructions with high vulnerability by observing fault injection results and carrying out redundancy reinforcement. Program instruction vulnerability is determined through program analysis based on a program analysis method, for example, an error propagation model is established to calculate instruction SDC vulnerability through analyzing the propagation path of program transient fault in the article Errorflow model, Modeling and analysis of software propagating hardware faults. The method based on machine learning combines the advantages of the former two, avoids complex propagation calculation process, and reduces fault injection overhead. In recent years, prediction of program fragile instructions by methods such as support vector machines and neural nets has been studied, but such methods require a large number of prediction data sets and complicated manual parameter adjustment to achieve higher accuracy.
Disclosure of Invention
The invention aims to provide a method and a system for predicting instruction vulnerability, which can reduce the data set scale and parameter adjustment complexity required by model training and can be applied to large-scale programs or complex environments.
The technical solution for realizing the purpose of the invention is as follows: an instruction vulnerability prediction method based on a deep random forest comprises the following steps:
step 1, performing static analysis on a training program, extracting instruction characteristic information related to each program instruction and instruction vulnerability, and generating an instruction characteristic vector V representing the instruction vulnerability corresponding to the program instructionfeatures;
Step 2, fault injection is carried out on the training program, and vulnerability values P of all program instructions are obtainedSDC(Ii);
Step 3, combining the instruction characteristic vector VfeaturesAnd an instruction vulnerability value PSDC(Ii) Generating an instruction vulnerability sample data set D, wherein each sample S in the data set comprises an instruction feature vector V corresponding to a certain program instructionfeaturesAnd a vulnerability value PSDC(Ii);
Step 4, sliding sampling is carried out on the instruction vulnerability sample data set D through a sliding window model, instruction sequence expansion characteristics of sample data are obtained, and an expansion sample data set is generated;
step 5, constructing and training an instruction vulnerability prediction model based on the deep random forest based on the extended sample data set;
and 6, extracting the instruction feature vector of the target program to be predicted according to the process of the step 1, and combining the instruction vulnerability prediction model obtained in the step 5 to realize the instruction vulnerability prediction of the target program to be predicted.
Further, the instruction feature vector V for characterizing the vulnerability of the instruction in step 1featuresThe following 7-tuple:
Vfeatures=<Vtran_bran,Vcomp,Vaddr,Vmask,Vloop,Varith,Vblock>
in the formula, Vtran_branIndicating branch and branch-related instruction characteristics, including branch-related characteristic fis_branchFunction call related feature fis_callReturn instruction feature fis_return;VcompIndicating compare-instruction-related features, including integer compare-instruction feature fis_int_cmpFloating point compare instruction characteristics fis_float_cmp;VaddrIndicating address-instruction-dependent features, including address-instruction-reference feature fis_used_in_addAddress width characteristic f of destination operation instructiondest_op_widthStore instruction characteristics fis_used_stroe;VmaskRepresenting fault-mask-related features, including logic and instruction features fis_andLogical OR instruction characteristic fis_orLogic shift instruction feature fis_sh;VloopIndicating loop instruction dependency characteristics, including loop position instruction characteristic fis_loopCycle depth characteristic floop_d;VarithRepresenting arithmetic operation correlation features, comprising: addition-subtraction instruction characteristics fis_add/subMultiply-divide instruction feature fis_mul/div;VblockRepresenting features related to basic block information, including: basic block length feature fbb_lengthCharacteristic f of the number of instructions to be executed in the basic blockbb_remain_ins_numNumber of precursor basic blocks characteristic fpred_bb_numThe number of subsequent basic blocks characteristic fsuc_bb_num。
Further, in step 2, obtaining the vulnerability value P of each program instructionSDC(Ii) The formula used is:
in the formula IiDenotes the ith program instruction, PSDC(Ii) Representing program instructions IiSDC vulnerability value of (M), w represents the bit width of the instruction destination register, MjRepresenting pairs of program instructions IiNumber of SDC failures after fault injection at jth bit, FjRepresents the pair instruction IiThe total number of fault injections performed by the jth bit.
Further, in step 4, the sliding sampling is performed on the instruction vulnerability sample data set D through the sliding window model, the instruction sequence expansion feature of the sample data is obtained, and the expanded sample data set is generated, which specifically includes:
let m be 2 as the initial value, m belongs to N*,2≤m≤p,p∈N*Setting a p value in a self-defined mode;
step 4-1, constructing a sliding window model as follows:
Wm=m×n
in the formula, WmThe width of the sliding window is, n is the characteristic number of each sample S in the instruction vulnerability sample data set D, and is the sliding step length of the sliding window;
step 4-2, splicing M samples in the instruction vulnerability sample data set D into a new sample Ei,EiThe characteristic number of (1) is M multiplied by n;
step 4-3, using sliding window model to sample EiPerforming sliding sampling to obtain M +1-M samples with the size of WmThe sample of (1);
step 4-4, utilizing two random forest regression models to carry out regression on M +1-M forest trees with the size of WmTraining the sample to obtain 2(M +1-M) regression values as an expansion characteristic; wherein the label value of the sample during training is the label value of a certain sample randomly selected from M samples, and the label value is the vulnerability value PSDC(Ii);
Step 4-5, increasing M by 1, judging whether M is larger than p, if not, returning to the step 4-1, otherwise, outputting the (p-1) x (2M-p) dimension expansion characteristics obtained in the whole circulation process;
and 4-6, splicing the original n-dimensional feature of each sample S with the (p-1) x (2M-p) dimensional expansion feature to obtain an expansion sample of the (p-1) x (2M-p) + n-dimensional feature corresponding to the sample S, so as to generate an expansion sample data set.
Further, the step 5 of constructing and training an instruction vulnerability prediction model based on the deep random forest based on the extended sample data set specifically includes:
step 5-1, constructing a first layer of cascade regression forest, wherein the first layer of cascade regression forest comprises N random forests, and constructing a first layer of cascade regression forest
4, the expansion sample data set obtained in the step 4 is used as an initial input vector in the deep random forest regression, and therefore an output vector comprising N enhanced features is output;
step 5-2, constructing the next layer of cascade regression forest, and outputting the vector v of the previous layer of cascade regression forestenhanedAnd the input vector vinputSpliced vinput,venhanced]As the input of the hierarchical cascade regression forest, then, evaluating the accuracy of the whole cascade forest up to the layer by using a cross validation method, namely calculating the mean square error between the regression result and the true value of all random forests on the layer;
step 5-3, judging whether the accuracy obtained in the step 5-2 is improved compared with the accuracy corresponding to the previous layer of cascade regression forest, if so, returning to the step 5-2; and otherwise, judging that the accuracy reaches a threshold value, not increasing the number of layers of the deep random forest any more, ending the construction and training process, obtaining an instruction vulnerability prediction model based on the deep random forest, wherein the average value of regression of all random forests in the last layer of cascade regression forest is the prediction result of the instruction vulnerability prediction model.
A deep random forest based instruction vulnerability prediction system, the system comprising:
the first feature extraction module is used for performing static analysis on the training program, extracting instruction feature information related to the instruction vulnerability of each program instruction, and generating an instruction feature vector V representing the instruction vulnerability corresponding to the program instructionfeatures;
A second feature extraction module for performing fault injection on the training program to obtain vulnerability value P of each program instructionSDC(Ii);
A first sample data set construction module for combining the instruction feature vector VfeaturesAnd an instruction vulnerability value PSDC(Ii) Generating an instruction vulnerability sample data set D, wherein each sample S in the data set comprises an instruction feature vector V corresponding to a certain program instructionfeaturesAnd a vulnerability value PSDC(Ii);
The second sample data set construction module is used for performing sliding sampling on the instruction vulnerability sample data set D through a sliding window model, obtaining the instruction sequence expansion characteristic of the sample data and generating an expansion sample data set;
the prediction model construction module is used for constructing and training an instruction vulnerability prediction model based on the deep random forest based on the extended sample data set;
and the prediction module is used for extracting the instruction feature vector of the target program to be predicted according to the working process of the first feature extraction module and realizing the instruction vulnerability prediction of the target program to be predicted by combining the instruction vulnerability prediction model.
Further, the second sample data set constructing module includes sequentially executed:
a parameter initialization unit for initializing m to 2, m belongs to N*,2≤m≤p,p∈N*;
A sliding window model construction unit, configured to construct a sliding window model as follows:
Wm=m×n
in the formula, WmThe width of the sliding window is, n is the characteristic number of each sample S in the instruction vulnerability sample data set D, and is the sliding step length of the sliding window;
a sample splicing unit for splicing M samples in the instruction vulnerability sample data set D into a new sample Ei,EiThe characteristic number of (1) is M multiplied by n;
a sliding sampling unit for sampling E with a sliding window modeliPerforming sliding sampling to obtain M +1-M samples with the size of WmThe sample of (1);
a training unit for training M +1-M forest regression models with WmTraining the sample to obtain 2(M +1-M) regression values as an expansion characteristic; wherein the label value of the sample during training is the label value of a certain sample randomly selected from M samples, and the label value is the vulnerability value PSDC(Ii);
The first judging unit is used for enabling M to be increased by 1, judging whether M is larger than p or not, if not, returning to the execution of the sliding window model building unit, and otherwise, outputting the (p-1) x (2M-p) dimension expansion characteristics obtained in the whole circulation process;
and the expanding sample data set constructing unit is used for splicing the original n-dimensional feature of each sample S and the (p-1) x (2M-p) dimensional expanding feature to obtain an expanding sample of the (p-1) x (2M-p) + n-dimensional feature corresponding to the sample S, so that an expanding sample data set is generated.
Further, the prediction model building module comprises sequentially executed:
the first cascade regression forest building unit is used for building a first layer of cascade regression forest which comprises N random forests, and an extended sample data set obtained by the second sample data set building module is used as an initial input vector in the deep random forest regression, so that an output vector comprising N enhanced features is output;
a second cascade regression forest construction unit for constructing the next cascade regression forest and outputting the output vector v of the previous cascade regression forestenhanedAnd the input vector vinputSpliced vinput,venhanced]As the input of the hierarchical cascade regression forest, then, evaluating the accuracy of the whole cascade forest up to the layer by using a cross validation method, namely calculating the mean square error between the regression result and the true value of all random forests on the layer;
the second judging unit is used for judging whether the accuracy obtained by the second cascade regression forest constructing unit is increased compared with the accuracy corresponding to the previous layer of cascade regression forest or not, and if yes, the second cascade regression forest constructing unit is executed in a returning mode; and otherwise, judging that the accuracy reaches a threshold value, not increasing the number of layers of the deep random forest any more, ending the construction and training process, obtaining an instruction vulnerability prediction model based on the deep random forest, wherein the average value of regression of all random forests in the last layer of cascade regression forest is the prediction result of the instruction vulnerability prediction model.
Compared with the prior art, the invention has the following remarkable advantages: 1) the deep random deep forest model can obtain high prediction accuracy on a small-scale sample, so that the prediction model only needs a small amount of training data collection work and is low in complexity; 2) the depth random forest model can automatically adjust the cascade depth according to the training accuracy, so that the parameter adjusting difficulty is reduced while the prediction accuracy is high; 3) sequence features among the instruction samples are extracted through a sliding window scanning method, so that the feature space can more accurately reflect the vulnerability of the instruction SDC, and the prediction accuracy is improved.
The present invention is described in further detail below with reference to the attached drawing figures.
Drawings
FIG. 1 is a flowchart of an instruction vulnerability prediction method based on a deep random forest according to the present invention.
FIG. 2 is a comparison graph of accuracy of prediction results in an embodiment of the present invention.
FIG. 3 is a diagram illustrating a comparison of mean square error of the prediction results of the present invention with other prediction methods in accordance with the present invention.
Detailed Description
With reference to fig. 1, the present invention provides a method for predicting instruction vulnerability based on a deep random forest, which includes the following steps:
step 1, performing static analysis on a training program, extracting instruction characteristic information related to each program instruction and instruction vulnerability, and generating an instruction characteristic vector V representing the instruction vulnerability corresponding to the program instructionfeatures(ii) a Wherein the instruction feature vector VfeaturesThe following 7-tuple:
Vfeatures=<Vtran_bran,Vcomp,Vaddr,Vmask,Vloop,Varith,Vblock>
in the formula, Vtran_branIndicating branch and branch-related instruction characteristics, including branch-related characteristic fis_branchFunction call related feature fis_callReturn instruction feature fis_return;VcompIndicating compare-instruction-related features, including integer compare-instruction feature fis_int_cmpFloating point compare instruction characteristics fis_float_cmp;VaddrIndicating address-instruction-dependent features, including address-instruction-reference feature fis_used_in_addAddress width characteristic f of destination operation instructiondest_op_widthStore instruction characteristics fis_used_stroe;VmaskRepresenting fault-mask-related features, including logic and instruction features fis_andLogical OR instruction characteristic fis_orLogic shift instruction feature fis_sh;VloopIndicating loop instruction dependency characteristics, including loop position instruction characteristic fis_loopCycle depth characteristic floop_d;VarithRepresenting arithmetic operation correlation features, comprising: addition-subtraction instruction characteristics fis_add/subMultiply-divide instruction feature fis_mul/div;VblockRepresenting features related to basic block information, including: basic block length feature fbb_lengthCharacteristic f of the number of instructions to be executed in the basic blockbb_remain_ins_numNumber of precursor basic blocks characteristic fpred_bb_numThe number of subsequent basic blocks characteristic fsuc_bb_num。
Step 2, fault injection is carried out on the training program, and vulnerability values P of all program instructions are obtainedSDC(Ii) The formula used is:
in the formula IiDenotes the ith program instruction, PSDC(Ii) Representing program instructions IiSDC vulnerability value of (w) represents an instructionBit width of destination register, MjRepresenting pairs of program instructions IiNumber of SDC failures after fault injection at jth bit, FjRepresents the pair instruction IiThe total number of fault injections performed by the jth bit.
Step 3, combining the instruction characteristic vector VfeaturesAnd an instruction vulnerability value PSDC(Ii) Generating an instruction vulnerability sample data set D, wherein each sample S in the data set comprises an instruction feature vector V corresponding to a certain program instructionfeaturesAnd a vulnerability value PSDC(Ii)。
And 4, performing sliding sampling on the instruction vulnerability sample data set D through a sliding window model to obtain the instruction sequence expansion characteristics of the sample data and generate an expansion sample data set. The method specifically comprises the following steps:
let m be 2 as the initial value, m belongs to N*,2≤m≤p,p∈N*Setting a p value in a self-defined mode;
step 4-1, constructing a sliding window model as follows:
Wm=m×n
in the formula, WmThe width of the sliding window is, n is the characteristic number of each sample S in the instruction vulnerability sample data set D, and is the sliding step length of the sliding window;
step 4-2, splicing M samples in the instruction vulnerability sample data set D into a new sample Ei,EiThe characteristic number of (1) is M multiplied by n;
step 4-3, using sliding window model to sample EiPerforming sliding sampling to obtain M +1-M samples with the size of WmThe sample of (1);
step 4-4, utilizing two random forest regression models to carry out regression on M +1-M forest trees with the size of WmTraining the sample to obtain 2(M +1-M) regression values as an expansion characteristic; wherein the label value of the sample during training is the label value of a certain sample randomly selected from M samples, and the label value is the vulnerability value PSDC(Ii);
Step 4-5, increasing M by 1, judging whether M is larger than p, if not, returning to the step 4-1, otherwise, outputting the (p-1) x (2M-p) dimension expansion characteristics obtained in the whole circulation process;
and 4-6, splicing the original n-dimensional feature of each sample S with the (p-1) x (2M-p) dimensional expansion feature to obtain an expansion sample of the (p-1) x (2M-p) + n-dimensional feature corresponding to the sample S, so as to generate an expansion sample data set.
Here, it is further preferable that p is 5 and M is 10, then step 4 specifically includes:
let m be 2 as the initial value, m belongs to N*,2≤m≤5;
Step 4-1, constructing a sliding window model as follows:
Wm=m×n
in the formula, WmThe width of the sliding window is, n is the characteristic number of each sample S in the instruction vulnerability sample data set D, and is the moving step length of the sliding window;
step 4-2, splicing 10 samples in the instruction vulnerability sample data set D into a new sample Ei,EiThe characteristic number of (2) is 10 xn;
step 4-3, using sliding window model to sample EiPerforming sliding sampling to obtain 11-m samples with the size of WmThe sample of (1);
step 4-4, utilizing two random forest regression models to perform pair treatment on 11-m forest regression models with the size of WmTraining the sample to obtain 2(11-m) regression values as an expansion characteristic; wherein the label value of the sample during training is the label value of the 10 th sample, and the label value is the vulnerability value PSDC(Ii);
Step 4-5, increasing m by 1, judging whether m is larger than 5, if not, returning to the step 4-1, otherwise, outputting 60-dimensional expansion characteristics obtained by the whole cycle;
and 4-6, splicing the original n-dimensional features and the 60-dimensional expansion features of each sample S to obtain an expansion sample of the 60+ n-dimensional features corresponding to the sample S, thereby generating an expansion sample data set.
And 5, constructing and training an instruction vulnerability prediction model based on the deep random forest based on the extended sample data set. The method specifically comprises the following steps:
step 5-1, constructing a first layer of cascade regression forest, wherein the cascade regression forest comprises N random forests, and taking the expansion sample data set obtained in the step 4 as an initial input vector in the deep random forest regression, so as to output an output vector comprising N enhanced features;
step 5-2, constructing the next layer of cascade regression forest, and outputting the vector v of the previous layer of cascade regression forestenhanedAnd the input vector vinputSpliced vinput,venhanced]As the input of the hierarchical cascade regression forest, then, evaluating the accuracy of the whole cascade forest up to the layer by using a cross validation method, namely calculating the mean square error between the regression result and the true value of all random forests on the layer;
step 5-3, judging whether the accuracy obtained in the step 5-2 is improved compared with the accuracy corresponding to the previous layer of cascade regression forest, if so, returning to the step 5-2; and otherwise, judging that the accuracy reaches the threshold value, not increasing the number of layers of the deep random forest any more, ending the construction and training process, obtaining an instruction vulnerability prediction model based on the deep random forest, and obtaining the prediction result of the instruction vulnerability prediction model by using the average value of regression of all random forests in the last layer of cascade regression forest.
Here, it is further preferable that the hierarchical joint regression forest in step 5-1 includes N random forests, specifically: the hierarchical joint regression forest comprises N-4 random forests which are respectively two random forest regression models fnormalAnd two extreme forest regression models fextremly。
And 6, extracting the instruction feature vector of the target program to be predicted according to the process of the step 1, and combining the instruction vulnerability prediction model obtained in the step 5 to realize the instruction vulnerability prediction of the target program to be predicted.
The invention provides an instruction vulnerability prediction system based on a deep random forest, which comprises:
the first characteristic extraction module is used for carrying out static analysis on the training program, extracting instruction characteristic information related to the vulnerability of each program instruction and generating the program instructionInstruction feature vector V corresponding to program instruction and representing instruction vulnerabilityfeatures。
A second feature extraction module for performing fault injection on the training program to obtain vulnerability value P of each program instructionSDC(Ii)。
A first sample data set construction module for combining the instruction feature vector VfeaturesAnd an instruction vulnerability value PSDC(Ii) Generating an instruction vulnerability sample data set D, wherein each sample S in the data set comprises an instruction feature vector V corresponding to a certain program instructionfeaturesAnd a vulnerability value PSDC(Ii)。
And the second sample data set construction module is used for performing sliding sampling on the instruction vulnerability sample data set D through the sliding window model, obtaining the instruction sequence expansion characteristic of the sample data and generating an expansion sample data set. The module specifically comprises the following steps:
a parameter initialization unit for initializing m to 2, m belongs to N*,2≤m≤p,p∈N*Setting a p value in a self-defined mode;
a sliding window model construction unit, configured to construct a sliding window model as follows:
Wm=m×n
in the formula, WmThe width of the sliding window is, n is the characteristic number of each sample S in the instruction vulnerability sample data set D, and is the sliding step length of the sliding window;
a sample splicing unit for splicing M samples in the instruction vulnerability sample data set D into a new sample Ei,EiThe characteristic number of (1) is M multiplied by n;
a sliding sampling unit for sampling E with a sliding window modeliPerforming sliding sampling to obtain M +1-M samples with the size of WmThe sample of (1);
a training unit for training M +1-M forest regression models with WmTraining the sample to obtain 2(M +1-M) regression values as an expansion characteristic; wherein the label value of the sample during training is randomly selected from M samplesTaking the label value of a certain sample, wherein the label value is the vulnerability value PSDC(Ii);
The first judging unit is used for enabling M to be increased by 1, judging whether M is larger than p or not, if not, returning to the execution of the sliding window model building unit, and otherwise, outputting the (p-1) x (2M-p) dimension expansion characteristics obtained in the whole circulation process;
and the expanding sample data set constructing unit is used for splicing the original n-dimensional feature of each sample S and the (p-1) x (2M-p) dimensional expanding feature to obtain an expanding sample of the (p-1) x (2M-p) + n-dimensional feature corresponding to the sample S, so that an expanding sample data set is generated.
And the prediction model construction module is used for constructing and training an instruction vulnerability prediction model based on the deep random forest based on the extended sample data set. The module comprises the following steps of:
the first cascade regression forest building unit is used for building a first layer of cascade regression forest which comprises N random forests, and an extended sample data set obtained by the second sample data set building module is used as an initial input vector in the deep random forest regression, so that an output vector comprising N enhanced features is output;
a second cascade regression forest construction unit for constructing the next cascade regression forest and outputting the output vector v of the previous cascade regression forestenhanedAnd the input vector vinputSpliced vinput,venhanced]As the input of the hierarchical cascade regression forest, then, evaluating the accuracy of the whole cascade forest up to the layer by using a cross validation method, namely calculating the mean square error between the regression result and the true value of all random forests on the layer;
the second judging unit is used for judging whether the accuracy obtained by the second cascade regression forest constructing unit is increased compared with the accuracy corresponding to the previous layer of cascade regression forest or not, and if yes, the second cascade regression forest constructing unit is executed in a returning mode; and otherwise, judging that the accuracy reaches the threshold value, not increasing the number of layers of the deep random forest any more, ending the construction and training process, obtaining an instruction vulnerability prediction model based on the deep random forest, and obtaining the prediction result of the instruction vulnerability prediction model by using the average value of regression of all random forests in the last layer of cascade regression forest.
And the prediction module is used for extracting the instruction feature vector of the target program to be predicted according to the working process of the first feature extraction module and realizing the instruction vulnerability prediction of the target program to be predicted by combining an instruction vulnerability prediction model.
The present invention will be described in further detail with reference to examples.
Examples
And (3) experimental environment configuration: intel i 78750H CPU, under the Ubuntu Linux 16.04 operating system in the 16G memory. Randomly selecting a part of test programs in a Mibench benchmark test set as a training set, extracting instruction features of a source program by using an analysis program based on an LLVM (Low level virtual machine) compiler to generate an instruction feature vector x, and injecting faults of the training program one by using an LLFI (LLVM based Fault Injection tool) to obtain an instruction SDC vulnerability value y. The total collection is about 4300 sample data, and the characteristic dimension n is 21.
Starting from the 10 th sample, sliding sampling operation is carried out one by utilizing a sliding window, 60-dimensional expansion characteristics are generated through two random forest regressors, expansion samples with 81-dimensional characteristics are finally obtained, and an expansion sample data set is generated. And then, taking the expanded sample data set as an input vector of a deep random forest, using 4 random forest regression models which are the same in pairs for each layer of random forest to generate a 4-dimensional enhancement vector, splicing the 4-dimensional enhancement vector with the initial 21 features to generate a 25-dimensional vector, and taking the 25-dimensional vector as the input of the next layer of random forest. And after the model training is finished, evaluating the accuracy on the test set.
Selecting Isqrt (square root calculation), FFT (Fourier transform), Dijkstra (shortest path planning algorithm), Bitstring (bit and character string conversion), Qsort (quick sorting) and Rad2deg (radian conversion) from a Mibench test suite. After the LLVM is subjected to feature extraction, the prediction model obtained through training is used for carrying out SDC vulnerability prediction on each instruction of the test program, the vulnerability average value of all instructions of each program is calculated, meanwhile, the prediction average values of other prediction models are compared, and the result is shown in figure 2. It can be seen from the figure that the predictive effect of the present invention is closer to the true value on each test program. Where Baseline represents the commanded vulnerability criterion value obtained by fault injection. FIG. 3 shows the comparison of the mean square error of the prediction results with other prediction methods, and it can be seen that the method of the present invention achieves the minimum error value on all test procedures.
In conclusion, the method has high prediction accuracy, low requirement on the sample set and less manual adjustment work, and can be effectively applied to prediction of instruction vulnerability after the program is influenced by the transient fault.
Claims (10)
1. An instruction vulnerability prediction method based on a deep random forest is characterized by comprising the following steps:
step 1, performing static analysis on a training program, extracting instruction characteristic information related to each program instruction and instruction vulnerability, and generating an instruction characteristic vector V representing the instruction vulnerability corresponding to the program instructionfeatures;
Step 2, fault injection is carried out on the training program, and vulnerability values P of all program instructions are obtainedSDC(Ii);
Step 3, combining the instruction characteristic vector VfeaturesAnd an instruction vulnerability value PSDC(Ii) Generating an instruction vulnerability sample data set D, wherein each sample S in the data set comprises an instruction feature vector V corresponding to a certain program instructionfeaturesAnd a vulnerability value PSDC(Ii);
Step 4, sliding sampling is carried out on the instruction vulnerability sample data set D through a sliding window model, instruction sequence expansion characteristics of sample data are obtained, and an expansion sample data set is generated;
step 5, constructing and training an instruction vulnerability prediction model based on the deep random forest based on the extended sample data set;
and 6, extracting the instruction feature vector of the target program to be predicted according to the process of the step 1, and combining the instruction vulnerability prediction model obtained in the step 5 to realize the instruction vulnerability prediction of the target program to be predicted.
2. The method for predicting the vulnerability of instructions based on deep random forest as claimed in claim 1, wherein the instruction feature vector V characterizing the vulnerability of instructions in step 1featuresThe following 7-tuple:
Vfeatures=〈Vtran_bran,Vcomp,Vaddr,Vmask,Vloop,Varith,Vblock〉
in the formula, Vtran_branIndicating branch and branch-related instruction characteristics, including branch-related characteristic fis_branchFunction call related feature fis_callReturn instruction feature fis_return;VcompIndicating compare-instruction-related features, including integer compare-instruction feature fis_int_cmpFloating point compare instruction characteristics fis_float_cmp;VaddrIndicating address-instruction-dependent features, including address-instruction-reference feature fis_used_in_addAddress width characteristic f of destination operation instructiondest_op_widthStore instruction characteristics fis_used_stroe;VmaskRepresenting fault-mask-related features, including logic and instruction features fis_andLogical OR instruction characteristic fis_orLogic shift instruction feature fis_sh;VloopIndicating loop instruction dependency characteristics, including loop position instruction characteristic fis_loopCycle depth characteristic floop_d;VarithRepresenting arithmetic operation correlation features, comprising: addition-subtraction instruction characteristics fis_add/subMultiply-divide instruction feature fis_mul/div;VblockRepresenting features related to basic block information, including: basic block length feature fbb_lengthCharacteristic f of the number of instructions to be executed in the basic blockbb_remain_ins_numNumber of precursor basic blocks characteristic fpred_bb_numThe number of subsequent basic blocks characteristic fsuc_bb_num。
3. The deep random forest-based finger of claim 1The method for predicting the vulnerability is characterized in that the step 2 of obtaining the vulnerability value P of each program instructionSDC(Ii) The formula used is:
in the formula IiDenotes the ith program instruction, PSDC(Ii) Representing program instructions IiSDC vulnerability value of (M), w represents the bit width of the instruction destination register, MjRepresenting pairs of program instructions IiNumber of SDC failures after fault injection at jth bit, FjRepresents the pair instruction IiThe total number of fault injections performed by the jth bit.
4. The method according to claim 1, wherein the step 4 is to perform sliding sampling on the instruction vulnerability sample data set D through a sliding window model to obtain an instruction sequence extension characteristic of sample data and generate an extension sample data set, and specifically includes:
let m be 2 as the initial value, m belongs to N*,2≤m≤p,p∈N*Setting a p value in a self-defined mode;
step 4-1, constructing a sliding window model as follows:
Wm=m×n
in the formula, WmThe width of the sliding window is, n is the characteristic number of each sample S in the instruction vulnerability sample data set D, and is the sliding step length of the sliding window;
step 4-2, splicing M samples in the instruction vulnerability sample data set D into a new sample Ei,EiThe characteristic number of (1) is M multiplied by n;
step 4-3, using sliding window model to sample EiPerforming sliding sampling to obtain M +1-M samples with the size of WmThe sample of (1);
step 4-4, utilizing two random forest regression models to carry out regression on M +1-M forest trees with the size of WmThe sample is trained to obtain 2(M +1-m) regression values as expansion characteristics; wherein the label value of the sample during training is the label value of a certain sample randomly selected from M samples, and the label value is the vulnerability value PSDC(Ii);
Step 4-5, increasing M by 1, judging whether M is larger than p, if not, returning to the step 4-1, otherwise, outputting the (p-1) x (2M-p) dimension expansion characteristics obtained in the whole circulation process;
and 4-6, splicing the original n-dimensional feature of each sample S with the (p-1) x (2M-p) dimensional expansion feature to obtain an expansion sample of the (p-1) x (2M-p) + n-dimensional feature corresponding to the sample S, so as to generate an expansion sample data set.
5. The method of claim 1, wherein p is 5 and M is 10.
6. The method for predicting the instruction vulnerability based on the deep random forest according to claim 1, wherein the step 5 of constructing and training the instruction vulnerability prediction model based on the deep random forest based on the extended sample data set specifically comprises:
step 5-1, constructing a first layer of cascade regression forest, wherein the cascade regression forest comprises N random forests, and taking the expansion sample data set obtained in the step 4 as an initial input vector in the deep random forest regression, so as to output an output vector comprising N enhanced features;
step 5-2, constructing the next layer of cascade regression forest, and outputting the vector v of the previous layer of cascade regression forestenhanedAnd the input vector vinputSpliced vinput,venhanced]As the input of the hierarchical cascade regression forest, then, evaluating the accuracy of the whole cascade forest up to the layer by using a cross validation method, namely calculating the mean square error between the regression result and the true value of all random forests on the layer;
step 5-3, judging whether the accuracy obtained in the step 5-2 is improved compared with the accuracy corresponding to the previous layer of cascade regression forest, if so, returning to the step 5-2; and otherwise, judging that the accuracy reaches a threshold value, not increasing the number of layers of the deep random forest any more, ending the construction and training process, obtaining an instruction vulnerability prediction model based on the deep random forest, wherein the average value of regression of all random forests in the last layer of cascade regression forest is the prediction result of the instruction vulnerability prediction model.
7. The method for predicting instruction vulnerability based on deep random forest as claimed in claim 6, wherein the hierarchical associative regression forest in step 5-1 includes N random forests, specifically: the hierarchical joint regression forest comprises N-4 random forests which are respectively two random forest regression models fnormalAnd two extreme forest regression models fextremly。
8. A system for instruction vulnerability prediction based on a deep random forest, the system comprising:
the first feature extraction module is used for performing static analysis on the training program, extracting instruction feature information related to the instruction vulnerability of each program instruction, and generating an instruction feature vector V representing the instruction vulnerability corresponding to the program instructionfeatures;
A second feature extraction module for performing fault injection on the training program to obtain vulnerability value P of each program instructionSDC(Ii);
A first sample data set construction module for combining the instruction feature vector VfeaturesAnd an instruction vulnerability value PSDC(Ii) Generating an instruction vulnerability sample data set D, wherein each sample S in the data set comprises an instruction feature vector V corresponding to a certain program instructionfeaturesAnd a vulnerability value PSDC(Ii);
The second sample data set construction module is used for performing sliding sampling on the instruction vulnerability sample data set D through a sliding window model, obtaining the instruction sequence expansion characteristic of the sample data and generating an expansion sample data set;
the prediction model construction module is used for constructing and training an instruction vulnerability prediction model based on the deep random forest based on the extended sample data set;
and the prediction module is used for extracting the instruction feature vector of the target program to be predicted according to the working process of the first feature extraction module and realizing the instruction vulnerability prediction of the target program to be predicted by combining the instruction vulnerability prediction model.
9. The system of claim 8, wherein the second sample data set construction module comprises, executed in order:
a parameter initialization unit for initializing m to 2, m belongs to N*,2≤m≤p,p∈N*Setting a p value in a self-defined mode;
a sliding window model construction unit, configured to construct a sliding window model as follows:
Wm=m×n
in the formula, WmThe width of the sliding window is, n is the characteristic number of each sample S in the instruction vulnerability sample data set D, and is the sliding step length of the sliding window;
a sample splicing unit for splicing M samples in the instruction vulnerability sample data set D into a new sample Ei,EiThe characteristic number of (1) is M multiplied by n;
a sliding sampling unit for sampling E with a sliding window modeliPerforming sliding sampling to obtain M +1-M samples with the size of WmThe sample of (1);
a training unit for training M +1-M forest regression models with WmTraining the sample to obtain 2(M +1-M) regression values as an expansion characteristic; wherein the label value of the sample during training is the label value of a certain sample randomly selected from M samples, and the label value is the vulnerability value PSDC(Ii);
The first judging unit is used for enabling M to be increased by 1, judging whether M is larger than p or not, if not, returning to the execution of the sliding window model building unit, and otherwise, outputting the (p-1) x (2M-p) dimension expansion characteristics obtained in the whole circulation process;
and the expanding sample data set constructing unit is used for splicing the original n-dimensional feature of each sample S and the (p-1) x (2M-p) dimensional expanding feature to obtain an expanding sample of the (p-1) x (2M-p) + n-dimensional feature corresponding to the sample S, so that an expanding sample data set is generated.
10. The system of claim 8, wherein the prediction model building module comprises, performed in sequence:
the first cascade regression forest building unit is used for building a first layer of cascade regression forest which comprises N random forests, and an extended sample data set obtained by the second sample data set building module is used as an initial input vector in the deep random forest regression, so that an output vector comprising N enhanced features is output;
a second cascade regression forest construction unit for constructing the next cascade regression forest and outputting the output vector v of the previous cascade regression forestenhanedAnd the input vector vinputSpliced vinput,venhanced]As the input of the hierarchical cascade regression forest, then, evaluating the accuracy of the whole cascade forest up to the layer by using a cross validation method, namely calculating the mean square error between the regression result and the true value of all random forests on the layer;
the second judging unit is used for judging whether the accuracy obtained by the second cascade regression forest constructing unit is increased compared with the accuracy corresponding to the previous layer of cascade regression forest or not, and if yes, the second cascade regression forest constructing unit is executed in a returning mode; and otherwise, judging that the accuracy reaches a threshold value, not increasing the number of layers of the deep random forest any more, ending the construction and training process, obtaining an instruction vulnerability prediction model based on the deep random forest, wherein the average value of regression of all random forests in the last layer of cascade regression forest is the prediction result of the instruction vulnerability prediction model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911248246.XA CN111159011B (en) | 2019-12-09 | 2019-12-09 | Instruction vulnerability prediction method and system based on deep random forest |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911248246.XA CN111159011B (en) | 2019-12-09 | 2019-12-09 | Instruction vulnerability prediction method and system based on deep random forest |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111159011A true CN111159011A (en) | 2020-05-15 |
CN111159011B CN111159011B (en) | 2022-05-20 |
Family
ID=70555803
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911248246.XA Active CN111159011B (en) | 2019-12-09 | 2019-12-09 | Instruction vulnerability prediction method and system based on deep random forest |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111159011B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113610154A (en) * | 2021-08-06 | 2021-11-05 | 吉林大学 | GPGPU program SDC error detection method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124497A1 (en) * | 2015-10-28 | 2017-05-04 | Fractal Industries, Inc. | System for automated capture and analysis of business information for reliable business venture outcome prediction |
CN108334903A (en) * | 2018-02-06 | 2018-07-27 | 南京航空航天大学 | A kind of instruction SDC fragility prediction techniques based on support vector regression |
CN108491317A (en) * | 2018-02-06 | 2018-09-04 | 南京航空航天大学 | A kind of SDC error-detecting methods of vulnerability analysis based on instruction |
CN109063775A (en) * | 2018-08-03 | 2018-12-21 | 南京航空航天大学 | Instruction SDC fragility prediction technique based on shot and long term memory network |
US20190258807A1 (en) * | 2017-09-26 | 2019-08-22 | Mcs2, Llc | Automated adjusting of devices |
-
2019
- 2019-12-09 CN CN201911248246.XA patent/CN111159011B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124497A1 (en) * | 2015-10-28 | 2017-05-04 | Fractal Industries, Inc. | System for automated capture and analysis of business information for reliable business venture outcome prediction |
US20190258807A1 (en) * | 2017-09-26 | 2019-08-22 | Mcs2, Llc | Automated adjusting of devices |
CN108334903A (en) * | 2018-02-06 | 2018-07-27 | 南京航空航天大学 | A kind of instruction SDC fragility prediction techniques based on support vector regression |
CN108491317A (en) * | 2018-02-06 | 2018-09-04 | 南京航空航天大学 | A kind of SDC error-detecting methods of vulnerability analysis based on instruction |
CN109063775A (en) * | 2018-08-03 | 2018-12-21 | 南京航空航天大学 | Instruction SDC fragility prediction technique based on shot and long term memory network |
Non-Patent Citations (1)
Title |
---|
张倩雯 等: "基于机器学习的指令SDC脆弱性分析方法", 《小型微型计算机系统》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113610154A (en) * | 2021-08-06 | 2021-11-05 | 吉林大学 | GPGPU program SDC error detection method and device |
CN113610154B (en) * | 2021-08-06 | 2023-12-29 | 吉林大学 | GPGPU program SDC error detection method and device |
Also Published As
Publication number | Publication date |
---|---|
CN111159011B (en) | 2022-05-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6473884B1 (en) | Method and system for equivalence-checking combinatorial circuits using interative binary-decision-diagram sweeping and structural satisfiability analysis | |
Gong et al. | Automatic detection of infeasible paths in software testing | |
US7216318B1 (en) | Method and system for false path analysis | |
US10936474B2 (en) | Software test program generation | |
US8230382B2 (en) | Model based simulation of electronic discharge and optimization methodology for design checking | |
US11734480B2 (en) | Performance modeling and analysis of microprocessors using dependency graphs | |
US20190243930A1 (en) | Methods and Apparatus for Transforming the Function of an Integrated Circuit | |
US11409916B2 (en) | Methods and apparatus for removing functional bugs and hardware trojans for integrated circuits implemented by field programmable gate array (FPGA) | |
JP4750665B2 (en) | Timing analysis method and apparatus | |
CN111159011B (en) | Instruction vulnerability prediction method and system based on deep random forest | |
Rejimon et al. | An accurate probabilistic model for error detection | |
US6792581B2 (en) | Method and apparatus for cut-point frontier selection and for counter-example generation in formal equivalence verification | |
Ritter et al. | Formal verification of designs with complex control by symbolic simulation | |
JP5625297B2 (en) | Delay test apparatus, delay test method, and delay test program | |
US6760894B1 (en) | Method and mechanism for performing improved timing analysis on virtual component blocks | |
Ganai et al. | Completeness in SMT-based BMC for software programs | |
JP2001052043A (en) | Error diagnosis method and error site proving method for combinational verification | |
CN112162932B (en) | Symbol execution optimization method and device based on linear programming prediction | |
US10852354B1 (en) | System and method for accelerating real X detection in gate-level logic simulation | |
Liu et al. | Tbem: Testing-based gpu-memory consumption estimation for deep learning | |
CN113901479A (en) | Security assessment framework and method for transient execution attack dynamic attack link | |
Chockler et al. | Efficient automatic STE refinement using responsibility | |
US8527922B1 (en) | Method and system for optimal counterexample-guided proof-based abstraction | |
Wang et al. | Fast and accurate statistical static timing analysis | |
Oyeniran et al. | High-Level Fault Diagnosis in RISC Processors with Implementation-Independent Functional Test |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |