CN115878498A

CN115878498A - Key byte extraction method for predicting program behavior based on machine learning

Info

Publication number: CN115878498A
Application number: CN202310195368.7A
Authority: CN
Inventors: 毛得明; 唐娜; 吴春明; 李芒
Original assignee: CETC 30 Research Institute
Current assignee: CETC 30 Research Institute
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2023-03-31

Abstract

The invention discloses a method for extracting key bytes based on machine learning prediction program behaviors, which comprises the following steps: running the input seed file on a target program, generating a large number of test sets X through a fuzzy test variation algorithm, performing code instrumentation on the target program to obtain a program PI, and running the test sets X on the program PI to obtain a test set Y; taking the X-Y data pairs of the test set as training data, and training the neural network model until the neural network model obtained by training can predict the behavior of the target program; constructing a saliency map based on a neural network model obtained by training, and extracting key bytes from the saliency map; wherein the key byte is an input that affects target program behavior. The invention effectively saves the time overhead and the performance overhead of program behavior tracking.

Description

Key byte extraction method for predicting program behavior based on machine learning

Technical Field

The invention relates to the technical field of computer information security, in particular to a method for extracting key bytes based on machine learning prediction program behaviors.

Background

Key bytes refer to inputs that affect the behavior of the target program, and based on a given set of inputs, the program behavior is observed and the bytes in the inputs that affect the program behavior are deduced back, i.e., key byte fetches are extracted. The extracted key bytes can be widely applied to the fields of system privacy data leakage, vulnerability detection, guidance fuzzy test and the like. In tracking program data flow, observing program behavior and extracting key bytes, a taint analysis technology is generally used, which detects the safety problem of a system by marking sensitive data in the system and tracking the propagation of marked data in the program, however, as the program scale is enlarged, the time cost brought by taint analysis is exponentially multiplied due to the need of tracking the information flow from taint source to taint gathering point in the responsible program.

At present, most program behavior tracking tools are realized based on taint analysis tools such as Valgrind, pin, qemu and the like. James Newsome publishes the Taintcheck developed based on Valgrind, which realizes the detection of the buffer overflow vulnerability, but ignores the tracking of the control flow. Wangjiang proposes a QEMU-based binary program offline dynamic taint analysis method, realizes extraction of a running track of a binary program by modifying a decoding and executing mechanism of QEMU, simultaneously finishes marking program input by using a HOOK technology, establishes a vulnerability model, and finishes offline program track analysis and program vulnerability detection according to a propagation strategy and a security check strategy generated by the vulnerability model while virtually replaying the program. However, the above methods all have a problem of excessive time consumption.

Machine learning is a recent research focus and it is enthusiastic to introduce methods of machine learning in different fields to improve the prior art. The TaintInduce provides a method for learning information propagation rules of a specific platform from input and output instructions. TaintInduce learns the information propagation rules based on the template and uses an algorithm to reduce the task to a prerequisite of only learning different input sets and information propagation labels. The TaintInduce learns the information propagation rule through a machine learning method, the accuracy of a single propagation rule is improved, but the TaintInduce still has the problems of high false alarm and high expense in the program behavior tracking process due to the propagation-based design.

Disclosure of Invention

In view of this, the present invention provides a method for extracting a key byte based on machine learning to predict program behavior, so as to solve the above technical problem.

The invention discloses a method for extracting key bytes based on machine learning prediction program behaviors, which comprises the following steps:

step 1: the method comprises the steps of running an input seed file on a target program, generating a large number of test sets X through a fuzzy test variation algorithm, performing code instrumentation on the target program to obtain a program PI, and running the test sets X on the program PI to obtain a test set Y;

step 2: taking the X-Y data pairs of the test set as training data, and training the neural network model until the neural network model obtained by training can predict the behavior of the target program; wherein, the test X is input data of the neural network, and the test set Y is label data of the neural network; the test set X-Y data pairs consist of a test set X and a test set Y;

and 3, step 3: constructing a saliency map based on a neural network model obtained by training, and extracting key bytes from the saliency map; wherein the key byte is an input that affects target program behavior.

Further, the step 1 comprises:

step 11: the fuzzy test takes the provided seed file as input, a large amount of variation operations are operated on a target program, and whether related results are caused after operation is checked; wherein the relevant result comprises that the target program crashes and a new execution path is found;

step 12: and performing basic block level code instrumentation on the target program to obtain a program PI, and running a test set X on the program PI to obtain an execution path of the target program, namely a test set Y.

Further, in step 11, to ensure that the length of the test set X is unchanged, three fuzzy test mutation algorithms, bitflip, arithmetric and intet, are selected to generate a large number of test sets X.

Further, the step 12 includes:

step 121: defining a function IFunc for insertion, wherein the function IFunc is inserted before each basic block, and when the basic block passes through in the execution process of a target program, the function IFunc is called to output the number of the basic block and the function where the basic block is located;

step 122: acquiring a target program;

step 123: initializing the number value num of the basic block, wherein the number value starts from 1;

step 124: traversing a function F of the target program;

step 125: traversing each basic block in the function F;

step 126: calling the inserted function IFunc;

step 127: adding 1 to the number value num of the basic block;

step 128: judging whether the traversal of the function F is finished, if not, executing the step 125, and if so, executing the step 129;

step 129: and executing the test set X by using the program PI, wherein the program PI outputs the position, the number and the file name of the executed basic block to obtain an execution path of the target program, namely the test set Y.

Further, in the step 2:

the neural network model is used for learning the data flow propagation process of different inputs in the target program by observing a large number of test set X-Y data pairs in the execution track of the target program and simulating the processing logic of the target program; the neural network model takes program input as model input and predicts an execution path of a target program.

Further, given one of the test set X-Y data pairs

And corresponding target program execution path>

When the output of the neural network model is &>

：

（1）

（2）

Wherein,

represents the second in test set XiRespective data +>

Represents the first in test set YiIndividual data->

Output vector representing a hidden layer of a neural network>

Represents a ReLU function, <' > based on>

、

The trainable parameters representing each level of each layer are,kindex number representing a layer, <' > or>

Indicates that the test data is->

Neural network models in time, in combination>

Trainable weight parameters representing a neural network model>

Representing a sigmod function. />

Further, the neural network model comprises an input layer, a hidden layer and an output layer; the input layer is connected with the output layer through the hidden layer.

Further, the step 3 comprises:

step 31: calculating partial derivatives of the execution path predicted by the trained neural network model relative to the test set X;

step 32: constructing a saliency map based on the partial derivatives;

step 33: key bytes are extracted from the saliency map.

Further, the step 31 specifically includes:

is provided withF（

) Indicates input->

Calculating a value of an output variable based on the value of an input->

Is defined as follows:

(3)

wherein,Xthe input data representing the neural network, i.e. test set X,

represents input>

The nth byte, partial derivative +>

A Jacobian matrix forming a neural network function, each element of the matrix representing an output ÷ or ÷ value>

Relative to the input

Gradient of each byte.

Further, the step 32 specifically includes:

saliency map

Is defined as follows:

(4)

wherein,

is->

Prediction output->

For all inputs->

The derivative sum of the nth byte represents the influence of the nth byte on the behavior of the currently executed target program, and the larger the value of the derivative sum is, the larger the influence is;

the step 33 specifically includes:

(5)

wherein,

the key bytes are important bytes which can affect the execution path of the target program in the input field; top _ k represents the function that selects the k largest elements from the vector, and arg represents the function that returns the index of the selected element.

Due to the adoption of the technical scheme, the invention has the following advantages:

the invention can predict the behavior of the program by means of machine learning model simulation and different expressions of the learning program, and realizes the light-weight and accurate end-to-end information flow tracking. Compared with the traditional method for tracking the program by using the taint analysis tool, the method effectively saves the time overhead and the performance overhead of program behavior tracking. The model obtained by training by the method is used for guiding subsequent work such as fuzzy test, vulnerability mining and the like, so that the working efficiency can be greatly improved, and the analysis time can be saved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings.

FIG. 1 is a flowchart illustrating a method for extracting key bytes based on machine learning prediction program behavior according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of pile insertion logic according to an embodiment of the present invention;

FIG. 3 is a diagrammatic illustration of a stake insertion flow of an embodiment of the present invention;

FIG. 4 is a diagram of a neural network model architecture according to an embodiment of the present invention;

FIG. 5 is a key byte diagram according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and examples, it being understood that the examples described are only some of the examples and are not intended to limit the invention to the embodiments described herein. All other embodiments available to those of ordinary skill in the art are intended to be within the scope of the embodiments of the present invention.

The technical problems to be solved by the invention are as follows:

(1) Accuracy problem of key byte extraction

When key byte extraction is carried out by using taint analysis, variables which have no data and control dependency relationship with program behaviors are often marked as taints, and false alarm is generated; if b is marked as taint, then for rule-based propagation, s and t will also be marked as taint, as s = a + b, t = s-b. However, b cannot actually affect t, which may cause excessive bytes in the input to be defined as key bytes, resulting in false alarm; alternatively, variables that have data and control dependencies on program behavior are not marked as dirty variables, resulting in false positives, e.g., if b >1 then a = 5, if b is marked as dirty and the value of b is greater than 1, a will not be marked as dirty since a does not have direct data communication with b, whereas in practice the value of a depends on b, which results in the byte in the input that should be the critical byte being ignored, resulting in false negations. Both of the above conditions affect the accuracy of key byte extraction.

(2) System resource and time overhead is excessive

When the taint analysis is used for extracting the key bytes, some codes for collecting information are inserted on the basis of not damaging the original logic of the target program, so that the relevant information of program operation is obtained. And a shadow memory is added on the basis of the original data to represent the pollution condition of the register and the memory. The method obtains the specific information in the program execution by inserting the piles in the program, has high analysis precision, but frequent pile inserting operation and the design of the shadow memory occupy a large amount of system resources, increases the time overhead, and increases the overhead exponentially along with the expansion of the program scale.

Referring to fig. 1, the present invention provides an embodiment of a method for extracting key bytes based on machine learning to predict program behavior; the embodiment of the invention uses a machine learning model of a neural network, a taint analysis technology tracks the propagation of marking data in a program with a large amount of time and resource cost, the neural network can predict the behavior of the program through different expressions of the learning program, and the influence of a taint source on a taint collection point in the program is calculated by utilizing gradient analysis, so that the lightweight end-to-end information flow tracking is realized.

The whole framework logic is expanded around the model and can be divided into 3 steps,

the fuzzy test takes the provided seed file as input, carries out a large amount of variation operations, checks whether the running results cause the crash of the target program, discovers a new execution path and the like. The mutation operation of the fuzz test generally comprises the following 6 types:

TABLE 1 fuzzy test variation operation

Serial number	Name(s)	Description of the preferred embodiment	Length variation
					1	bitflip	Flip by bit, 1 to 0 to 1	Without change
2	arithmetic	Integer add/subtract arithmetic operation	Without change
				3	interet	Replacing special contents in original file	Without change
4	dictionary	Replacing/inserting automatically generated or user-provided tokens into the original document	There are variations
				5	havoc	Making a great deal of variation on the original document	There is a change in
6	splice	Splicing two files to obtain a new file	There are variations

In order to ensure that the length of the test set X is unchanged, three fuzzy test variation algorithms of bitflip, arithmetric and intet are selected to generate a large number of test sets, then code instrumentation is performed on a target program, the test set X is operated on the instrumented program to obtain the path execution condition of the program, and the test set Y is collected.

Code instrumentation logic code that performs certain functions is inserted before and after instructions that are intended to be observed or processed in the assembly code of the object program, as shown in figure 2. Code instrumentation can be performed at 3 granularities, typically from the instruction level, basic block level, and function level. This patent chooses to do basic block-level code instrumentation on the target program.

The basic block is a program execution statement with only one inlet and one outlet, the function is generally divided into a plurality of basic blocks by jump instructions such as 'CMP', if the first instruction of one basic block is executed, the rest of the instructions of the basic block are executed. Compared with instrumentation of instruction-level granularity, instrumentation for basic blocks can save time and shorten program scale; instrumentation for basic blocks may improve accuracy compared to instrumentation at function-level granularity. The process of performing instrumentation on the basic blocks is shown in fig. 3. The method comprises the following steps:

1) Defining a function IFunc for insertion, wherein the function IFunc is used for defining an output function, and before each basic block, when the basic block passes through in the program execution process, the function is firstly called to output the number of the basic block and the function where the basic block is located;

2) Acquiring a target program;

3) Initializing the number value num of the basic block, wherein the number value starts from 1;

4) Function F of traversing target program

5) Traversing each basic block in the function F;

6) Calling the inserted function IFunc;

7) The number value num of the basic block is added with 1;

8) And after the traversal of the functions is finished, outputting the basic block quantity information corresponding to each function of the program.

And executing the test set X by using the instrumented program, wherein the program outputs the position and information of the executed basic block, so that a test set Y containing program execution path information is obtained, and a test set of X-Y data pairs is provided for model training.

the test set X is the input of the target program, the test set Y is the execution path of the target program, the input of the target program is usually user input, files or user privacy character strings, and in order to facilitate the understanding and the recognition of the model, the method converts the byte sequence into a bounded numerical value vector with the range of [0,255 ]. The method processes the information, normalizes the execution path variable by binary data, indicates that the basic block is executed by 1, indicates that the basic block is not executed by 0, and uniformly normalizes the test set Y into 01 character strings with the same length so as to ensure the rapid convergence of the model.

The method adopts a neural network to construct a training model, and the model consists of 3 completely communicated layers, namely an input layer, a hidden layer and an output layer. The hidden layer uses ReLU as an activation function for 4096 hidden units and the output layer predicts the variables using sigmod as an activation function.

The model learns the propagation process of the information flow by observing a large number of X-Y pairs in the program execution trace. The detailed architecture of the model, which takes program input as model input and predicts the program execution path, is shown in fig. 4. Given a specific set of inputs for a particular program

And a corresponding program execution path pick>

With the model predicting an execution path being >>

The formula is as follows:

（1）

（2）

wherein,

represents the second in test set XiIndividual data->

Represents the first in test set YiIndividual data->

Output vector representing a hidden layer of a neural network>

Represents the ReLU function->

、

The trainable parameters representing each level of the hierarchy,kindex number representing a layer, <' > based on a predetermined index number>

Indicates that the test data is->

Neural network models in time, in combination>

Trainable weight parameters representing a neural network model>

Representing a sigmod function.

After training the neural network model, the method analyzes information flow in the target program by constructing a saliency map, which is detailed in step 3.

The key byte refers to an input that affects the behavior of the target program, and as shown in fig. 5, the red portion is a key byte diagram of a pdf file, and assuming that the length of the input x of the target program is m, the key byte is a portion of the data with the length of m, which affects the program execution path. This patent uses gradient analysis and saliency maps to compute the key bytes in the taint data. The saliency map is a gradient-based attribution method, and compared with other gradient-based methods, the saliency map focuses on the sensitivity of the neural network output to each feature, i.e., how the neural network output changes with respect to a minute change of the input. In this method, the saliency map method is chosen to be used because it is desirable to infer from the neural network which byte in the input would affect the execution path of the target program, i.e., produce the greatest sensitivity to the output of the neural network.

To extract the key bytes, the partial derivatives of the execution path predicted by the trained neural network model with respect to test set X are first computed. Is provided with

Indicates input->

The value of the output variable is calculated in relation to a given input ≧ during execution of the target program>

Is defined as follows:

(3)

wherein,Xthe input data representing the neural network, i.e. test set X,

indicates input->

The nth byte, the partial derivative->

A Jacobian matrix forming a neural network function, each element of the matrix representing an output ≥>

Relative to the input

Gradient of each byte. Then, based on the partial derivatives of the neural network model, a saliency map is constructed, the saliency map ≥>

Is a vector, which is defined as follows:

(4)/>

in the formula

Is->

Prediction output in neural network models

The sum of the derivatives of the nth byte of all the inputs represents the influence of the nth byte on the behavior of the currently executed program, and the influence value is represented by a number, wherein the larger the number is, the larger the influence is. The flow of program execution information may be analyzed using saliency maps. After the saliency map is generated, finally, the key byte is extracted, formulated as Down, and/or based on>

Indicates a selection->

A function of the maximum element, based on the value of the sum of the values of the coefficients>

Representing a function that returns the selected element index:

(5)

wherein,

is the important byte in the input field that will affect the execution path of the target program, i.e. the key byte.

Most of the program behaviors of the parser programs are determined by bytes with specified input positions, namely fixed positions of file format header files, and not by file contents. After analyzing a plurality of file formats, the total number of key bytes of the file parsing class program is found to be between 250 and 500, and accounts for about 5% of the total input bytes. In practice, 5% of threshold value can be selected to calculate key byte

The number of the base can be modified according to the actual situation.

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A method for extracting key bytes based on machine learning prediction program behaviors is characterized by comprising the following steps:

step 1: running the input seed file on a target program, generating a large number of test sets X through a fuzzy test variation algorithm, performing code instrumentation on the target program to obtain a program PI, and running the test sets X on the program PI to obtain a test set Y;

and 2, step: taking the X-Y data pairs of the test set as training data, and training the neural network model until the neural network model obtained by training can predict the behavior of the target program; wherein, the test X is input data of the neural network, and the test set Y is label data of the neural network; the test set X-Y data pairs consist of a test set X and a test set Y;

and step 3: constructing a saliency map based on a neural network model obtained by training, and extracting key bytes from the saliency map; wherein the key byte is an input that affects target program behavior.

2. The method of claim 1, wherein step 1 comprises:

step 11: the fuzzy test takes the provided seed file as input, a large amount of variation operation is operated on the target program, and whether relevant results are caused after operation is checked; wherein the relevant result comprises that the target program crashes and a new execution path is found;

step 12: and performing basic block-level code instrumentation on the target program to obtain a program PI, and running a test set X on the program PI to obtain an execution path of the target program, namely a test set Y.

3. The method according to claim 2, wherein in step 11, three fuzzy test mutation algorithms bitflip, arithmetric and intet are selected to generate a large number of test sets X in order to ensure that the length of the test sets X is constant.

4. The method of claim 2, wherein step 12 comprises:

step 121: defining a function IFunc for insertion, wherein the function IFunc is inserted before each basic block, and when the basic block passes through the target program in the execution process, firstly calling the function IFunc to output the number of the basic block and the function where the basic block is located;

step 122: acquiring a target program;

step 124: traversing a function F of the target program;

step 125: traversing each basic block in the function F;

step 126: calling the inserted function IFunc;

step 127: the number value num of the basic block is added with 1;

step 129: and executing the test set X by using the program PI, and outputting the position, the number and the file name of the executed basic block by using the program PI to obtain an execution path of the target program, namely the test set Y.

5. The method according to claim 1, characterized in that in step 2:

6. The method of claim 5, wherein the test set X-Y data pairs are given

And corresponding target program execution path>

When the output of the neural network model is &>

：

（1）

（2）/>

Wherein,

represents the first in test set XiRespective data +>

Represents the second in test set YiIndividual data->

Output vector representing a hidden layer of a neural network>

Represents the ReLU function->

、

Trainable on behalf of each levelThe parameters are set to be in a predetermined range,kthe index number of the presentation layer is,

indicates that the test data is->

Neural network model of time, based on the comparison of the measured time and the measured time>

Trainable weight parameters representing a neural network model>

Representing a sigmod function.

7. The method of any one of claims 1-6, wherein the neural network model comprises an input layer, a hidden layer, and an output layer; the input layer is connected with the output layer through the hidden layer.

8. The method of claim 6, wherein step 3 comprises:

step 32: constructing a saliency map based on the partial derivatives;

step 33: key bytes are extracted from the saliency map.

9. The method according to claim 8, wherein said step 31 is specifically:

is provided withF（

) Represents input>

Calculating a value of an output variable based on the value of an input->

Is defined as follows:

(3)

wherein,Xthe input data representing the neural network, i.e. test set X,

indicates input->

Nth byte, partial derivative of (1)

Relative to the input->

Gradient of each byte.

10. The method according to claim 9, wherein the step 32 is specifically:

significant figure

Is defined as follows:

(4)

wherein,

is->

Prediction output->

For all inputs->

The derivative sum of the nth byte represents the influence of the nth byte on the behavior of the currently executed target program, and the larger the numerical value of the nth byte is, the larger the influence is;

the step 33 is specifically:

(5)

wherein,

the key bytes are important bytes which can affect the execution path of the target program in the input field; top _ k represents the function that selects the k largest elements from the vector, and arg represents the function that returns the index of the selected element. />