CN115576840A

CN115576840A - Static program pile insertion detection method and device based on machine learning

Info

Publication number: CN115576840A
Application number: CN202211357366.5A
Authority: CN
Inventors: 刘昱玮; 余媛萍; 贾相堃; 苏璞睿
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2022-11-01
Filing date: 2022-11-01
Publication date: 2023-01-06
Anticipated expiration: 2042-11-01
Also published as: CN115576840B

Abstract

The invention discloses a program instrumentation detection method and device based on machine learning, and belongs to the technical field of network security. The method comprises the following steps: acquiring a target program; performing pile insertion on a target program to obtain a binary file; converting the binary file into an intermediate language representation; calculating the characteristic vector of each basic block in the binary file based on the intermediate language expression, and sending the characteristic vector into a machine learning model to identify the actual instrumentation result of each basic block; generating a code attribute graph based on the target program and the intermediate language expression, learning the code attribute graph by using a graph neural network, and performing linear transformation on the learned node feature vectors to judge the expected pile insertion result of each basic block according to the score of each node; and obtaining the pile inserting detection result of the target program according to the actual pile inserting result and the expected pile inserting result of each basic block. The invention can evaluate the accuracy of the existing static program pile inserting method.

Description

Static program pile insertion detection method and device based on machine learning

Technical Field

The invention belongs to the technical field of network security, and particularly relates to a program instrumentation detection method and device based on machine learning.

Background

Program instrumentation refers to the process of inserting code segments into a target program that can feed back specific information or perform specific functions without destroying the original running logic integrity of the target program. Program instrumentation may provide information support for program analysis and vulnerability mining methods including taint analysis, symbolic execution, and fuzz testing. Where static program instrumentation is currently the popular method of program instrumentation due to its efficiency advantages.

The existing improvement of the static program instrumentation method focuses on improving the information quantity acquired by the program instrumentation and improving the execution efficiency of the program instrumentation. For example, angora developed by Chen Peng of the university of shanghai science and technology, and TortoiseFuzz developed by Wang Yanhao of the university of chinese academy of sciences, improve vulnerability mining capability by acquiring more information of a program in the running process, including program running context, program memory access times, and the like; selectiveTaint, developed by Chen Sanchuan et al, ohio state university, improves taint analysis execution efficiency by inserting taint analysis instrumentation code only for certain instructions.

While the above improvements improve the capacity and efficiency of program instrumentation, higher accuracy requirements are also placed on program instrumentation, and no research has been proposed to date for static program instrumentation accuracy.

Disclosure of Invention

Aiming at the problem that the accuracy of the existing static program instrumentation research is not concerned, the invention aims to provide a static program instrumentation error detection method based on machine learning, which extracts the code characteristics of basic blocks of a target program through static analysis, inputs the extracted basic block characteristics into word2vec to identify the instrumentation situation in each basic block, converts the program into a code attribute graph and inputs the code attribute graph into a graph neural network to identify the instrumentation error.

The technical content of the invention comprises:

a method of machine learning-based procedural instrumentation detection, the method comprising:

acquiring a target program, and performing instrumentation on the target program to obtain a binary file;

converting the binary file into an intermediate language representation;

calculating a feature vector of each basic block in the binary file based on the intermediate language representation, and sending the feature vector to a machine learning model to identify an actual instrumentation result of each basic block;

generating a code attribute graph based on the target program and the intermediate language representation, learning the code attribute graph by using a graph neural network, and performing linear transformation on the learned node feature vectors to judge the expected instrumentation result of each basic block according to the score of each node; the nodes in the code attribute graph are basic blocks, and edges in the code attribute graph are constructed based on the dependency relationship between control flows and data flows among the basic blocks;

and obtaining the instrumentation detection result of the target program according to the actual instrumentation result and the expected instrumentation result of each basic block.

Further, the intermediate language representation includes: LLVM IR intermediate language representation.

Further, the calculating a feature vector of each basic block in the binary file based on the intermediate language includes:

selecting features in each basic block based on the intermediate language; the features include: instruction operation codes, instruction operands, and instruction sequences; the instruction sequence is a sequence in which the instruction operation codes are ordered according to the appearance sequence of the instructions in the basic block;

acquiring a serial number of the instruction operation code according to an operation code coding table, and acquiring a feature vector of the instruction operation code based on the serial number; the operation code coding table is constructed based on the occurrence frequency and the letter sequence of each instruction operation code in the training set;

acquiring a feature vector of the instruction operand according to the operand length of the instruction operand;

calculating a feature vector of the instruction sequence according to the feature vector of each basic block in the instruction sequence;

and obtaining the feature vector of the basic block based on the feature vector of the instruction operation code, the feature vector of the instruction operand and the feature vector of the instruction sequence.

Further, the operation code includes: alloca, store, load, and icmp.

Further, the operands include: immediate and variable.

Further, the generating a code attribute map of the binary file based on the target program and the intermediate language includes:

generating an abstract syntax tree, a control flow graph and a program dependency graph of the binary file based on the target program and the intermediate language; wherein, the basic block information in the control flow graph includes the out degree and the in degree of the basic block, the number of instructions and whether instrumentation codes are included, and the information in the program dependency graph includes: data dependency and control dependency information of the basic block;

and merging the abstract syntax tree, the control flow graph and the program dependency graph into a code attribute graph.

Further, the obtaining an instrumentation detection result of the target program according to the actual instrumentation result and the expected instrumentation result of each basic block includes:

comparing the actual pile inserting result with the expected pile inserting result one by one;

if the comparison result of each basic block is consistent, the pile inserting detection result is that the pile inserting is correct;

and if the comparison result of at least one basic block is not consistent, the instrumentation detection result of the target program is an instrumentation error.

A machine learning-based procedural instrumentation detection apparatus, the apparatus comprising:

the file acquisition module is used for acquiring a target program and performing instrumentation on the target program to obtain a binary file;

the file conversion module is used for converting the binary file into an intermediate language representation;

the first detection module is used for calculating the characteristic vector of each basic block in the binary file based on the intermediate language representation and sending the characteristic vector to a machine learning model so as to identify the actual instrumentation result of each basic block;

the second detection module is used for generating a code attribute graph of the binary file based on the target program and the intermediate language representation, learning the code attribute graph by using a graph neural network, and performing linear transformation on the learned node feature vectors so as to judge the expected instrumentation result of each basic block according to the score of each node; the nodes in the code attribute graph are basic blocks, and edges in the code attribute graph are constructed based on the dependency relationship between control flow and data flow among the basic blocks;

and the result generation module is used for obtaining the instrumentation detection result of the target program according to the actual instrumentation result and the expected instrumentation result of each basic block.

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform any of the above methods when executed.

An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform any of the methods described above.

Compared with the prior art, the invention has the following advantages and positive effects:

1. the invention realizes the automatic identification of the instrumentation code by extracting the code characteristics of the target program and inputting the characteristics into the machine learning model, thereby realizing the universal identification capability of the instrumentation code aiming at various static program instrumentation tools.

2. The invention extracts and codes LLVM IR code characteristics of a basic block of a target program, and converts the target program into an intermediate language irrelevant to the architecture so as to realize cross-architecture instrumentation code identification.

3. According to the invention, by generating the abstract syntax tree of the target program, the control flow diagram and the program dependence diagram and combining the abstract syntax tree, the control flow diagram and the program dependence diagram into the code attribute diagram, the characteristics of the input model can more comprehensively represent the target program, and the accuracy of pile insertion error identification is improved.

Drawings

FIG. 1 is a flowchart of an embodiment of a static program instrumentation detection method based on machine learning according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the static program instrumentation detection method based on machine learning of the present invention, the flow chart of the whole method is shown in fig. 1.

In fig. 1, the static program instrumentation detection method for machine learning includes the following steps:

1. selecting a proper experimental sample, and constructing a binary file data set containing cross-architecture static instrumentation

Collecting target program samples in a Google fuzzy Bench database, and compiling under x86, x86_64 and AArch64 architectures by using a instrumentation tool in AFL to obtain binary files compiled by different architectures and completing static instrumentation.

2. Extracting characteristic information of experimental sample

The binary files in the dataset are converted to LLVM IR intermediate language by the McSema tool. The operation code including the instruction in the basic block, the operand of the instruction, and the instruction sequence are selected as the extracted features. The operation code comprises alloca, store, load, icmp and the like; the operands include an immediate and a variable; the instruction sequence is a sequence in which the operation codes are ordered according to the appearance order of the instructions in the basic block.

3. Feature encoding feature information extracted from a data set

In the obtained characteristics, the instruction operation codes are sorted according to the occurrence frequencies of the instruction operation codes in the data set, the instruction operation codes with the same frequencies are sorted according to the alphabetical order, and the instruction operation codes are coded into serial numbers; encoding the instruction operand according to the operand length of the instruction operand, wherein the encoding is the corresponding byte length, namely if the instruction operand is an immediate number, assigning 0; if the instruction operand is a variable, encoding the instruction operand into the byte length of the operand according to the length of the variable of the operand; the instruction sequence is encoded as a vector corresponding to the opcode encoding in the basic block.

4. Features are fed into a machine learning model for training to identify actual instrumentation results in the basic blocks

And (3) training the data set by using a word2vec machine learning method, training the feature codes obtained before, and judging the training effect of each model on the data set by using the accuracy, the AUC value, the recall rate and the F1 value. And averagely dividing the processed data characteristics into ten parts, selecting one part as a test set, and taking the other nine parts as a training set, and judging whether the instrumentation codes exist in the basic block or not in turn.

5. Constructing a code attribute graph for an experimental sample

And generating a corresponding abstract syntax tree, a control flow graph and a program dependency graph based on the target binary program and the LLVM IR intermediate language generated in the second step. The basic block information in the control flow graph comprises the out degree, the in degree, the number of contained instructions, whether instrumentation codes are contained or not and the like of the basic block. The information in the program dependency graph includes data dependency and control dependency information for the basic blocks. And finally, combining the generated abstract syntax tree, the control flow graph and the program dependency graph into a code attribute graph, wherein the combined code attribute graph takes the basic blocks as graph nodes, the initial nodes as the inlet basic blocks of the target program, and the control flow and data flow dependency relations among the basic blocks are taken as the edges of the graph nodes.

6. Inputting the constructed code attribute graph into a machine learning model for training, and realizing detection of an expected instrumentation result

And (3) selecting a graph neural network machine learning method to train a data set, training the characteristics of the obtained code attribute graph, and after the training is finished, performing linear transformation on the vector representation of the node to obtain the score of each basic block so as to judge whether the pile insertion is expected to exist in each basic block. The accuracy, AUC, recall, and F1 values were used to assess the training effect of each model on the data set. And averagely dividing the processed data characteristics into ten parts, selecting one part as a test set, and taking the other nine parts as a training set, and alternately judging whether the instrumentation is expected to exist.

7. Detection of pile insertion errors is achieved based on actual pile insertion results and expected pile insertion results

Comparing whether the actual pile inserting result is consistent with the expected pile inserting result one by one according to basic blocks: if the basic block is consistent, the instrumentation error does not exist in the basic block, and if the basic block is not consistent, the instrumentation error exists in the basic block. And finally outputting a instrumentation error detection report of the target program.

In summary, the static program instrumentation error detection method based on machine learning provided by the invention fills the blank in the aspect of static program instrumentation error detection. The cross-architecture automatic program instrumentation error detection method can be used for multiple static program instrumentation tools, and the accuracy of the conventional static program instrumentation method is evaluated.

The invention also discloses a program instrumentation detection device based on machine learning, which can be computer equipment and also can be arranged in the computer equipment. The device includes: the device comprises a file acquisition module, a file conversion module, a first detection module, a second detection module and a result generation module.

the second detection module is used for generating a code attribute graph of the binary file based on the target program and the intermediate language representation, learning the code attribute graph by using a graph neural network, and performing linear transformation on the learned node feature vectors so as to judge an expected instrumentation result of each basic block according to the score of each node; the nodes in the code attribute graph are basic blocks, and edges in the code attribute graph are constructed based on the dependency relationship between control flows and data flows among the basic blocks;

For the explanation of the specific execution process, beneficial effects, etc. of the device module, please refer to the description of the above method embodiment, which is not described herein again.

In an exemplary embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the computer program being loaded and executed by the processor to implement the above-mentioned machine learning-based program instrumentation detection method.

In an exemplary embodiment, a computer-readable storage medium is also provided, on which a computer program is stored, which, when being executed by a processor, implements the machine learning-based program instrumentation detection method as described above.

In an exemplary embodiment, a computer program product is also provided, which, when run on a computer device, causes the computer device to perform the machine learning based program instrumentation detection method as described above.

Although specific embodiments of the invention have been disclosed for illustrative purposes and the accompanying drawings, which are included to provide a further understanding of the invention and are incorporated by reference, those skilled in the art will appreciate that: various substitutions, alterations, and modifications are possible without departing from the spirit and scope of this disclosure and the appended claims. Therefore, the present invention should not be limited to the disclosure of the preferred embodiments and the drawings, but the scope of the invention is defined by the appended claims.

Claims

1. A program instrumentation detection method based on machine learning, the method comprising:

converting the binary file into an intermediate language representation;

generating a code attribute graph based on the target program and the intermediate language representation, learning the code attribute graph by using a graph neural network, and performing linear transformation on the learned node feature vectors to judge an expected pile insertion result of each basic block according to the score of each node; the nodes in the code attribute graph are basic blocks, and edges in the code attribute graph are constructed based on the dependency relationship between control flows and data flows among the basic blocks;

2. The method of claim 1, wherein the intermediate language representation comprises: LLVM IR intermediate language representation.

3. The method of claim 1, wherein said computing a feature vector for each basic block in the binary file based on the intermediate language comprises:

4. The method of claim 3, wherein the opcode comprises: alloca, store, load, and icmp.

5. The method of claim 3, wherein the operands comprise: immediate and variable.

6. The method of claim 1, wherein generating the code property graph for the binary file based on the target program and the intermediate language comprises:

7. The method of claim 1, wherein obtaining instrumentation detection results for the target program based on the actual instrumentation results and the expected instrumentation results for each basic block comprises:

8. A program stake detection apparatus based on machine learning, the apparatus comprising:

the second detection module is used for generating a code attribute graph of the binary file based on the target program and the intermediate language representation, learning the code attribute graph by using a graph neural network, and performing linear transformation on the learned node feature vectors so as to judge the expected instrumentation result of each basic block according to the score of each node; the nodes in the code attribute graph are basic blocks, and edges in the code attribute graph are constructed based on the dependency relationship between control flows and data flows among the basic blocks;

9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when executed, perform the method of any of claims 1-7.

10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to execute the computer program to perform the method according to any of claims 1-7.