CN115576840A - Static program stub detection method and device based on machine learning - Google Patents

Static program stub detection method and device based on machine learning Download PDF

Info

Publication number
CN115576840A
CN115576840A CN202211357366.5A CN202211357366A CN115576840A CN 115576840 A CN115576840 A CN 115576840A CN 202211357366 A CN202211357366 A CN 202211357366A CN 115576840 A CN115576840 A CN 115576840A
Authority
CN
China
Prior art keywords
basic block
instrumentation
result
program
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211357366.5A
Other languages
Chinese (zh)
Other versions
CN115576840B (en
Inventor
刘昱玮
余媛萍
贾相堃
苏璞睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202211357366.5A priority Critical patent/CN115576840B/en
Publication of CN115576840A publication Critical patent/CN115576840A/en
Application granted granted Critical
Publication of CN115576840B publication Critical patent/CN115576840B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Prevention of errors by analysis, debugging or testing of software
    • G06F11/362Debugging of software
    • G06F11/3644Debugging of software by instrumenting at runtime
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Prevention of errors by analysis, debugging or testing of software
    • G06F11/3604Analysis of software for verifying properties of programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Prevention of errors by analysis, debugging or testing of software
    • G06F11/362Debugging of software
    • G06F11/3636Debugging of software by tracing the execution of the program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a program instrumentation detection method and device based on machine learning, and belongs to the technical field of network security. The method comprises the following steps: acquiring a target program; performing pile insertion on a target program to obtain a binary file; converting the binary file into an intermediate language representation; calculating the characteristic vector of each basic block in the binary file based on the intermediate language expression, and sending the characteristic vector into a machine learning model to identify the actual instrumentation result of each basic block; generating a code attribute graph based on the target program and the intermediate language expression, learning the code attribute graph by using a graph neural network, and performing linear transformation on the learned node feature vectors to judge the expected pile insertion result of each basic block according to the score of each node; and obtaining the pile inserting detection result of the target program according to the actual pile inserting result and the expected pile inserting result of each basic block. The invention can evaluate the accuracy of the existing static program pile inserting method.

Description

基于机器学习的静态程序插桩检测方法及装置Static program stub detection method and device based on machine learning

技术领域technical field

本发明属于网络安全技术领域,具体涉及一种基于机器学习的程序插桩检测方法及装置。The invention belongs to the technical field of network security, and in particular relates to a method and device for program stub detection based on machine learning.

背景技术Background technique

程序插桩是指在不破坏目标程序原有运行逻辑完整性的基础上向目标程序中插入可以反馈特定信息或执行特定功能的代码片段的过程。程序插桩可以为包括污点分析、符号执行和模糊测试在内的程序分析和漏洞挖掘方法提供信息支撑。其中静态程序插桩由于其在效率上的优势成为当前热门的程序插桩方法。Program instrumentation refers to the process of inserting code fragments that can feed back specific information or perform specific functions into the target program without destroying the integrity of the original operating logic of the target program. Program instrumentation can provide information support for program analysis and vulnerability mining methods including taint analysis, symbol execution and fuzz testing. Among them, static program instrumentation has become a popular program instrumentation method due to its advantages in efficiency.

现有的对静态程序插桩方法的改进集中于提升程序插桩获取的信息数量和提升程序插桩的执行效率。例如上海科技大学的陈鹏等人研发的Angora、中国科学院大学的王衍豪等人研发的TortoiseFuzz,通过获取程序在运行过程中包括程序运行上下文和程序内存访问次数等在内的更多信息以提升漏洞挖掘能力;俄亥俄州立大学的陈三川等人研发的SelectiveTaint通过仅针对特定指令插入污点分析插桩代码以提升污点分析执行效率。Existing improvements to static program instrumentation methods focus on increasing the amount of information obtained by program instrumentation and improving the execution efficiency of program instrumentation. For example, Angora developed by Chen Peng of Shanghai University of Science and Technology, and TortoiseFuzz developed by Wang Yanhao of University of Chinese Academy of Sciences, etc., improve the vulnerability by obtaining more information during the running process of the program, including the program running context and the number of program memory accesses. Mining capability; SelectiveTaint developed by Chen Sanchuan and others from Ohio State University improves the execution efficiency of taint analysis by inserting taint analysis instrumentation code only for specific instructions.

虽然以上改进提升了程序插桩的能力和效率,但是也对程序插桩提出了更高的精度要求,而目前还没有针对静态程序插桩准确性的研究被提出。Although the above improvements have improved the capability and efficiency of program instrumentation, they also put forward higher precision requirements for program instrumentation, and so far no research on the accuracy of static program instrumentation has been proposed.

发明内容Contents of the invention

针对现有的针对静态程序插桩研究没有关注其准确性的问题,本发明的目的在于提供一种基于机器学习的静态程序插桩错误检测方法,通过静态分析提取目标程序基本块的代码特征,并将提取的基本块特征输入word2vec中识别各基本块中插桩情况,将程序转换为代码属性图并将其输入图神经网络中以识别插桩错误。Aiming at the problem that the existing static program instrumentation research does not pay attention to its accuracy, the purpose of the present invention is to provide a static program instrumentation error detection method based on machine learning, which extracts the code features of the basic block of the target program through static analysis, And input the extracted basic block features into word2vec to identify the insertion situation in each basic block, convert the program into a code attribute graph and input it into the graph neural network to identify insertion errors.

本发明的技术内容包括:Technical contents of the present invention include:

一种基于机器学习的程序插桩检测方法,所述方法包括:A machine learning-based program instrumentation detection method, the method comprising:

获取目标程序,并对所述目标程序进行插桩,得到一个二进制文件;Obtaining the target program, and instrumenting the target program to obtain a binary file;

将所述二进制文件转换为一中间语言表示;converting said binary file into an intermediate language representation;

基于所述中间语言表示,计算所述二进制文件中各基本块的特征向量,并将所述特征向量送入机器学习模型,以识别每一基本块的实际插桩结果;Based on the intermediate language representation, calculating the feature vectors of each basic block in the binary file, and sending the feature vectors into a machine learning model to identify the actual posting results of each basic block;

基于所述目标程序和所述中间语言表示,生成代码属性图,并对所述代码属性图使用图神经网络学习,将学习后的节点特征向量进行线性变换,以根据各节点的得分判断每一基本块的预期插桩结果;其中,所述代码属性图中的节点为基本块,所述代码属性图中的边基于基本块间的控制流和数据流依赖关系构建;Generate a code attribute graph based on the target program and the intermediate language representation, and use a graph neural network to learn the code attribute graph, and linearly transform the learned node feature vectors to judge each node according to the score of each node. The expected posting result of the basic block; wherein, the nodes in the code attribute graph are basic blocks, and the edges in the code attribute graph are constructed based on the control flow and data flow dependencies between the basic blocks;

根据每一基本块的所述实际插桩结果与所述预期插桩结果,得到所述目标程序的插桩检测结果。According to the actual instrumentation result and the expected instrumentation result of each basic block, an instrumentation detection result of the target program is obtained.

进一步地,所述中间语言表示包括:LLVM IR中间语言表示。Further, the intermediate language representation includes: LLVM IR intermediate language representation.

进一步地,所述基于所述中间语言,计算所述二进制文件中各基本块的特征向量,包括:Further, the calculation of the feature vectors of each basic block in the binary file based on the intermediate language includes:

基于所述中间语言,选取每一基本块中的特征;所述特征包括:指令操作码、指令操作数和指令序列;其中,所述指令序列为所述指令操作码按照所述基本块中指令出现顺序进行排序的序列;Based on the intermediate language, the features in each basic block are selected; the features include: instruction opcode, instruction operand and instruction sequence; wherein, the instruction sequence is the instruction opcode according to the instructions in the basic block Sequences sorted in order of appearance;

根据操作码编码表,获取所述指令操作码的序号,并基于所述序号,得到所述指令操作码的特征向量;其中,所述操作码编码表基于训练集中各指令操作码的出现频率和字母顺序构建;Obtain the sequence number of the instruction opcode according to the operation code encoding table, and obtain the feature vector of the instruction operation code based on the sequence number; wherein, the operation code encoding table is based on the frequency of occurrence and sum of each instruction opcode in the training set alphabetical build;

根据所述指令操作数的操作数长度,获取所述指令操作数的特征向量;Acquiring the feature vector of the instruction operand according to the operand length of the instruction operand;

根据所述指令序列中各基本块的特征向量,计算所述指令序列的特征向量;calculating the feature vector of the instruction sequence according to the feature vector of each basic block in the instruction sequence;

基于所述指令操作码的特征向量、所述指令操作数的特征向量与所述指令序列的特征向量,得到所述基本块的特征向量。The feature vector of the basic block is obtained based on the feature vector of the instruction opcode, the feature vector of the instruction operand, and the feature vector of the instruction sequence.

进一步地,所述操作码包括:alloca、store、load和icmp。Further, the operation codes include: alloca, store, load and icmp.

进一步地,所述操作数包括:立即数和变量。Further, the operands include: immediate data and variables.

进一步地,所述基于所述目标程序和所述中间语言,生成所述二进制文件的代码属性图,包括:Further, the generating the code attribute map of the binary file based on the target program and the intermediate language includes:

基于所述目标程序和所述中间语言,生成所述二进制文件的抽象语法树、控制流图和程序依赖图;其中,所述控制流图中基本块信息包括该基本块的出度与入度、指令数量和是否包含插桩代码,所述程序依赖图中信息包括:基本块的数据依赖和控制依赖信息;Generate an abstract syntax tree, a control flow graph, and a program dependency graph of the binary file based on the target program and the intermediate language; wherein, the basic block information in the control flow graph includes the out-degree and in-degree of the basic block , the number of instructions and whether it contains stub code, the information in the program dependency graph includes: data dependency and control dependency information of basic blocks;

将抽象语法树,控制流图和程序依赖图合并为代码属性图。Merge abstract syntax trees, control flow graphs, and program dependency graphs into code property graphs.

进一步地,所述根据每一基本块的所述实际插桩结果与所述预期插桩结果,得到所述目标程序的插桩检测结果,包括:Further, according to the actual instrumentation result and the expected instrumentation result of each basic block, the instrumentation detection result of the target program is obtained, including:

逐个基本块比较实际插桩结果与预期插桩结果;Compare the actual instrumentation results with the expected instrumentation results on a basic block basis;

如果每一基本块的比较结果都是相符,则所述插桩检测结果为插桩正确;If the comparison result of each basic block is consistent, then the stubbing detection result is correct stubbing;

如果至少一个基本块的比较结果是不相符,则所述目标程序的插桩检测结果为插桩错误。If the comparison result of at least one basic block is inconsistent, the instrumentation detection result of the target program is an instrumentation error.

一种基于机器学习的程序插桩检测装置,所述装置包括:A machine learning-based program instrumentation detection device, the device comprising:

文件获取模块,用于获取目标程序,并对所述目标程序进行插桩,得到一个二进制文件;A file acquisition module, configured to acquire a target program, and perform instrumentation on the target program to obtain a binary file;

文件转换模块,用于将所述二进制文件转换为一中间语言表示;a file conversion module, configured to convert the binary file into an intermediate language representation;

第一检测模块,用于基于所述中间语言表示,计算所述二进制文件中各基本块的特征向量,并将所述特征向量送入机器学习模型,以识别每一基本块的实际插桩结果;The first detection module is used to calculate the feature vector of each basic block in the binary file based on the intermediate language representation, and send the feature vector into a machine learning model to identify the actual posting result of each basic block ;

第二检测模块,用于基于所述目标程序和所述中间语言表示,生成所述二进制文件的代码属性图,并对所述代码属性图使用图神经网络学习,将学习后的节点特征向量进行线性变换,以根据各节点的得分判断每一基本块的预期插桩结果;其中,所述代码属性图中的节点为基本块,所述代码属性图中的边基于基本块间的控制流和数据流依赖关系构建;The second detection module is configured to generate a code attribute graph of the binary file based on the target program and the intermediate language representation, and use a graph neural network to learn the code attribute graph, and perform learning on the node feature vector after learning. Linear transformation, to judge the expected insertion result of each basic block according to the score of each node; wherein, the nodes in the code attribute graph are basic blocks, and the edges in the code attribute graph are based on the control flow and Data flow dependency construction;

结果生成模块,用于根据每一基本块的所述实际插桩结果与所述预期插桩结果,得到所述目标程序的插桩检测结果。The result generation module is used to obtain the instrumentation detection result of the target program according to the actual instrumentation result and the expected instrumentation result of each basic block.

一种存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一所述方法。A storage medium, in which a computer program is stored, wherein the computer program is configured to perform any one of the above-mentioned methods when running.

一种电子设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一所述方法。An electronic device includes a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to run the computer program to perform any one of the above-mentioned methods.

与现有技术相比,本发明的优点和积极效果如下:Compared with prior art, advantage and positive effect of the present invention are as follows:

1.本发明通过提取目标程序代码特征并输入机器学习模型中实现对插桩代码的自动化识别,实现了针对多种静态程序插桩工具的插桩代码的通用识别能力。1. The present invention realizes the automatic recognition of the instrumentation code by extracting the characteristics of the target program code and inputting it into the machine learning model, and realizes the general recognition ability for the instrumentation code of various static program instrumentation tools.

2.本发明通过提取目标程序基本块的LLVM IR代码特征并进行编码,将目标程序转换为架构无关的中间语言,以实现跨架构的插桩代码识别。2. The present invention converts the target program into an architecture-independent intermediate language by extracting and encoding the LLVM IR code features of the basic blocks of the target program, so as to realize cross-architecture instrumentation code recognition.

3.本发明通过生成目标程序的抽象语法树,控制流图,和程序依赖图并合并为代码属性图,使得输入模型的特征可以更为全面的表示目标程序,提升了插桩错误识别的准确性。3. The present invention generates the abstract syntax tree, control flow graph, and program dependency graph of the target program and merges them into a code attribute graph, so that the characteristics of the input model can more comprehensively represent the target program and improve the accuracy of instrumentation error recognition sex.

附图说明Description of drawings

图1为本发明一种基于机器学习的静态程序插桩检测方法的一个实施例中的流程图。FIG. 1 is a flowchart of an embodiment of a static program instrumentation detection method based on machine learning in the present invention.

具体实施方式detailed description

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

在本发明一种基于机器学习的静态程序插桩检测方法中,整个方法的流程图如图1所示。In a static program instrumentation detection method based on machine learning in the present invention, the flow chart of the entire method is shown in FIG. 1 .

图1中,该机器学习的静态程序插桩检测方法包括如下步骤:In Figure 1, the machine learning static program instrumentation detection method includes the following steps:

1.选取合适的实验样本,构建包含跨架构的完成静态插桩的二进制文件数据集1. Select a suitable experimental sample and build a binary file data set containing cross-architecture static instrumentation

在Google FuzzBench数据库中收集目标程序样本,使用AFL中的插桩工具在x86,x86_64,AArch64架构下分别进行编译得到不同架构编译的完成静态插桩的二进制文件。Collect target program samples in the Google FuzzBench database, use the instrumentation tool in AFL to compile under the x86, x86_64, and AArch64 architectures respectively, and obtain binary files compiled with different architectures that have completed static instrumentation.

2.对实验样本进行特征信息提取2. Extract feature information from experimental samples

通过McSema工具将数据集中的二进制文件转换为LLVM IR中间语言。选取包括基本块中指令的操作码,指令的操作数,以及指令序列作为提取的特征。操作码包括alloca、store、load、icmp等;操作数包括立即数和变量;指令序列为操作码按照基本块中指令出现顺序排序的序列。The binary files in the dataset are converted to LLVM IR intermediate language by the McSema tool. Opcodes of instructions in basic blocks, operands of instructions, and instruction sequences are selected as extracted features. The operation codes include alloca, store, load, icmp, etc.; the operands include immediate numbers and variables; the instruction sequence is a sequence in which the operation codes are sorted according to the order in which the instructions appear in the basic block.

3.对从数据集中提取到的特征信息进行特征编码3. Perform feature encoding on the feature information extracted from the data set

在所获得的特征中,对于指令操作码根据其在数据集中出现频率进行排序,频率相同的按照字母顺序进行排序,将指令操作码编码为序号;对于指令操作数则根据其操作数长度进行编码,编码为对应的字节长度,即如果指令操作数为立即数,则赋0;如果指令操作数为变量,则根据操作数变量的长度,编码为操作数的字节长度;指令序列则是编码为对应基本块中操作码编码的向量。Among the obtained features, the instruction opcodes are sorted according to their frequency of occurrence in the data set, and those with the same frequency are sorted alphabetically, and the instruction opcodes are encoded as serial numbers; the instruction operands are encoded according to their operand length , encoded as the corresponding byte length, that is, if the instruction operand is an immediate value, assign 0; if the instruction operand is a variable, it is encoded as the byte length of the operand according to the length of the operand variable; the instruction sequence is Encoded as a vector corresponding to the opcode encoding in the basic block.

4.将特征送入机器学习模型进行训练以识别基本块中的实际插桩结果4. Feed the features into the machine learning model for training to recognize the actual instrumentation results in the basic blocks

选用word2vec机器学习方法来训练数据集,对之前得到的特征编码进行训练,使用准确率、AUC值、召回率和F1值来评判各个模型对数据集的训练效果。将处理后的数据特征平均分成十份,选择其中一份作为测试集,其余九份作为训练集,轮流对基本块中是否存在插桩代码做出判断。The word2vec machine learning method is used to train the data set, and the previously obtained feature codes are trained, and the accuracy, AUC value, recall rate and F1 value are used to judge the training effect of each model on the data set. Divide the processed data features into ten parts on average, select one of them as a test set, and the remaining nine as a training set, and make a judgment on whether there is a stub code in the basic block in turn.

5.对实验样本构建代码属性图5. Construct code attribute graphs for experimental samples

基于目标二进制程序和第二步中生成的LLVM IR中间语言生成对应的抽象语法树,控制流图,和程序依赖图。控制流图中基本块信息包括该基本块的出度,入度,包含指令数量,是否包含插桩代码等。程序依赖图中信息包括基本块的数据依赖和控制依赖信息。最终将生成的抽象语法树,控制流图和程序依赖图合并为代码属性图,合并后的代码属性图以基本块为图节点,初始节点为目标程序的入口基本块,以各基本块间的控制流和数据流依赖关系作为图节点的边。Generate the corresponding abstract syntax tree, control flow graph, and program dependency graph based on the target binary program and the LLVM IR intermediate language generated in the second step. The basic block information in the control flow graph includes the out-degree and in-degree of the basic block, the number of instructions included, and whether the instrumentation code is included. The information in the program dependency graph includes the data dependency and control dependency information of basic blocks. Finally, the generated abstract syntax tree, control flow graph and program dependency graph are merged into a code attribute graph. The merged code attribute graph uses basic blocks as graph nodes, and the initial node is the entry basic block of the target program. Control flow and data flow dependencies act as edges of graph nodes.

6.将构建的代码属性图输入机器学习模型进行训练,实现对预期插桩结果的检测6. Input the constructed code attribute map into the machine learning model for training to realize the detection of the expected posting results

选用图神经网络机器学习方法来训练数据集,对之前得到的代码属性图特征进行训练,并在训练结束后,将节点的向量表示进行线性变换,得到各基本块的得分,以对各基本块中是否预期存在插桩做出判断。使用准确率、AUC值、召回率和F1值来评判各个模型对数据集的训练效果。将处理后的数据特征平均分成十份,选择其中一份作为测试集,其余九份作为训练集,轮流对是否预期存在插桩做出判断。The graph neural network machine learning method is used to train the data set, and the previously obtained code attribute graph features are trained, and after the training, the vector representation of the node is linearly transformed to obtain the score of each basic block, so as to evaluate each basic block Make a judgment on whether the instrumentation is expected to exist. Use the accuracy rate, AUC value, recall rate and F1 value to judge the training effect of each model on the data set. Divide the processed data features into ten parts on average, select one of them as the test set, and the remaining nine as the training set, and make a judgment on whether there is expected posting in turn.

7.基于实际插桩结果和预期插桩结果,实现对插桩错误的检测7. Based on the actual pile insertion results and expected pile insertion results, the detection of pile insertion errors is realized

逐个基本块比较实际插桩结果与预期插桩结果是否相符:如果相符则该基本块不存在插桩错误,如果不相符则该基本块存在插桩错误。最终输出目标程序的插桩错误检测报告。Check whether the actual insertion result is consistent with the expected insertion result one by one: if they match, there is no insertion error in the basic block, and if they do not match, there is an insertion error in the basic block. Finally, an instrumentation error detection report of the target program is output.

综上,本发明提出的一种基于机器学习的静态程序插桩错误检测方法,填补了现有静态程序插桩错误检测方面的空白。实现了一种可以针对多种静态程序插桩工具的跨架构的自动化程序插桩错误检测方法,评估现有静态程序插桩方法的精度。To sum up, the machine learning-based static program stub error detection method proposed by the present invention fills the gap in the existing static program stub fault detection. A cross-architecture automatic program instrumentation error detection method that can target various static program instrumentation tools is implemented, and the accuracy of existing static program instrumentation methods is evaluated.

本发明还公开了一种基于机器学习的程序插桩检测装置,该装置可以为计算机设备,也可以设置在计算机设备中。该装置包括:文件获取模块、文件转换模块、第一检测模块、第二检测模块和结果生成模块。The invention also discloses a machine learning-based program stub detection device. The device can be a computer device, and can also be set in the computer device. The device includes: a file acquisition module, a file conversion module, a first detection module, a second detection module and a result generation module.

文件获取模块,用于获取目标程序,并对所述目标程序进行插桩,得到一个二进制文件;A file acquisition module, configured to acquire a target program, and perform instrumentation on the target program to obtain a binary file;

文件转换模块,用于将所述二进制文件转换为一中间语言表示;a file conversion module, configured to convert the binary file into an intermediate language representation;

第一检测模块,用于基于所述中间语言表示,计算所述二进制文件中各基本块的特征向量,并将所述特征向量送入机器学习模型,以识别每一基本块的实际插桩结果;The first detection module is used to calculate the feature vector of each basic block in the binary file based on the intermediate language representation, and send the feature vector into a machine learning model to identify the actual posting result of each basic block ;

第二检测模块,用于基于所述目标程序和所述中间语言表示,生成所述二进制文件的代码属性图,并对所述代码属性图使用图神经网络学习,将学习后的节点特征向量进行线性变换,以根据各节点的得分判断每一基本块的预期插桩结果;其中,所述代码属性图中的节点为基本块,所述代码属性图中的边基于基本块间的控制流和数据流依赖关系构建;The second detection module is configured to generate a code attribute graph of the binary file based on the target program and the intermediate language representation, and use a graph neural network to learn the code attribute graph, and perform learning on the node feature vector after learning. Linear transformation, to judge the expected insertion result of each basic block according to the score of each node; wherein, the nodes in the code attribute graph are basic blocks, and the edges in the code attribute graph are based on the control flow and Data flow dependency construction;

结果生成模块,用于根据每一基本块的所述实际插桩结果与所述预期插桩结果,得到所述目标程序的插桩检测结果。The result generation module is used to obtain the instrumentation detection result of the target program according to the actual instrumentation result and the expected instrumentation result of each basic block.

有关装置模块的具体执行过程、有益效果等阐述,请参见上述方法实施例的介绍说明,此处不多赘述。For the specific implementation process and beneficial effects of the device modules, please refer to the description of the above-mentioned method embodiments, and details will not be repeated here.

在示例性实施例中,还提供了一种计算机设备,所述计算机设备包括存储器和处理器,所述存储器中存储有计算机程序,所述计算机程序由所述处理器加载并执行,以实现上述基于机器学习的程序插桩检测方法。In an exemplary embodiment, there is also provided a computer device, the computer device includes a memory and a processor, a computer program is stored in the memory, and the computer program is loaded and executed by the processor, so as to realize the above-mentioned A program instrumentation detection method based on machine learning.

在示例性实施例中,还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如上述基于机器学习的程序插桩检测方法。In an exemplary embodiment, there is also provided a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above machine learning-based program instrumentation detection method is implemented.

在示例性实施例中,还提供了一种计算机程序产品,当所述计算机程序产品在计算机设备上运行时,使得计算机设备执行如上述基于机器学习的程序插桩检测方法。In an exemplary embodiment, a computer program product is also provided, which, when the computer program product is run on a computer device, causes the computer device to execute the above machine learning-based program instrumentation detection method.

尽管为说明目的公开了本发明的具体实施例和附图,其目的在于帮助理解本发明的内容并据以实施,但是本领域的技术人员可以理解:在不脱离本发明及所附的权利要求的精神和范围内,各种替换、变化和修改都是可能的。因此,本发明不应局限于最佳实施例和附图所公开的内容,本发明要求保护的范围以权利要求书界定的范围为准。Although specific embodiments and drawings of the present invention are disclosed for the purpose of illustration, the purpose is to help understand the content of the present invention and implement it accordingly, but those skilled in the art can understand that: without departing from the present invention and the appended claims Various substitutions, changes and modifications are possible within the spirit and scope of . Therefore, the present invention should not be limited to the content disclosed in the preferred embodiments and drawings, and the protection scope of the present invention should be defined by the claims.

Claims (10)

1. A program instrumentation detection method based on machine learning, the method comprising:
acquiring a target program, and performing instrumentation on the target program to obtain a binary file;
converting the binary file into an intermediate language representation;
calculating a feature vector of each basic block in the binary file based on the intermediate language representation, and sending the feature vector to a machine learning model to identify an actual instrumentation result of each basic block;
generating a code attribute graph based on the target program and the intermediate language representation, learning the code attribute graph by using a graph neural network, and performing linear transformation on the learned node feature vectors to judge an expected pile insertion result of each basic block according to the score of each node; the nodes in the code attribute graph are basic blocks, and edges in the code attribute graph are constructed based on the dependency relationship between control flows and data flows among the basic blocks;
and obtaining the instrumentation detection result of the target program according to the actual instrumentation result and the expected instrumentation result of each basic block.
2. The method of claim 1, wherein the intermediate language representation comprises: LLVM IR intermediate language representation.
3. The method of claim 1, wherein said computing a feature vector for each basic block in the binary file based on the intermediate language comprises:
selecting features in each basic block based on the intermediate language; the features include: instruction operation codes, instruction operands, and instruction sequences; the instruction sequence is a sequence in which the instruction operation codes are ordered according to the appearance sequence of the instructions in the basic block;
acquiring a serial number of the instruction operation code according to an operation code coding table, and acquiring a feature vector of the instruction operation code based on the serial number; the operation code coding table is constructed based on the occurrence frequency and the letter sequence of each instruction operation code in the training set;
acquiring a feature vector of the instruction operand according to the operand length of the instruction operand;
calculating a feature vector of the instruction sequence according to the feature vector of each basic block in the instruction sequence;
and obtaining the feature vector of the basic block based on the feature vector of the instruction operation code, the feature vector of the instruction operand and the feature vector of the instruction sequence.
4. The method of claim 3, wherein the opcode comprises: alloca, store, load, and icmp.
5. The method of claim 3, wherein the operands comprise: immediate and variable.
6. The method of claim 1, wherein generating the code property graph for the binary file based on the target program and the intermediate language comprises:
generating an abstract syntax tree, a control flow graph and a program dependency graph of the binary file based on the target program and the intermediate language; wherein, the basic block information in the control flow graph includes the out degree and the in degree of the basic block, the number of instructions and whether instrumentation codes are included, and the information in the program dependency graph includes: data dependency and control dependency information of the basic block;
and merging the abstract syntax tree, the control flow graph and the program dependency graph into a code attribute graph.
7. The method of claim 1, wherein obtaining instrumentation detection results for the target program based on the actual instrumentation results and the expected instrumentation results for each basic block comprises:
comparing the actual pile inserting result with the expected pile inserting result one by one;
if the comparison result of each basic block is consistent, the pile inserting detection result is that the pile inserting is correct;
and if the comparison result of at least one basic block is not consistent, the instrumentation detection result of the target program is an instrumentation error.
8. A program stake detection apparatus based on machine learning, the apparatus comprising:
the file acquisition module is used for acquiring a target program and performing instrumentation on the target program to obtain a binary file;
the file conversion module is used for converting the binary file into an intermediate language representation;
the first detection module is used for calculating the characteristic vector of each basic block in the binary file based on the intermediate language representation and sending the characteristic vector to a machine learning model so as to identify the actual instrumentation result of each basic block;
the second detection module is used for generating a code attribute graph of the binary file based on the target program and the intermediate language representation, learning the code attribute graph by using a graph neural network, and performing linear transformation on the learned node feature vectors so as to judge the expected instrumentation result of each basic block according to the score of each node; the nodes in the code attribute graph are basic blocks, and edges in the code attribute graph are constructed based on the dependency relationship between control flows and data flows among the basic blocks;
and the result generation module is used for obtaining the instrumentation detection result of the target program according to the actual instrumentation result and the expected instrumentation result of each basic block.
9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when executed, perform the method of any of claims 1-7.
10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to execute the computer program to perform the method according to any of claims 1-7.
CN202211357366.5A 2022-11-01 2022-11-01 Static program pile insertion detection method and device based on machine learning Active CN115576840B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211357366.5A CN115576840B (en) 2022-11-01 2022-11-01 Static program pile insertion detection method and device based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211357366.5A CN115576840B (en) 2022-11-01 2022-11-01 Static program pile insertion detection method and device based on machine learning

Publications (2)

Publication Number Publication Date
CN115576840A true CN115576840A (en) 2023-01-06
CN115576840B CN115576840B (en) 2023-04-18

Family

ID=84589190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211357366.5A Active CN115576840B (en) 2022-11-01 2022-11-01 Static program pile insertion detection method and device based on machine learning

Country Status (1)

Country Link
CN (1) CN115576840B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361182A (en) * 2023-04-03 2023-06-30 南京航空航天大学 Symbol execution method for error state guidance
CN116578979A (en) * 2023-05-15 2023-08-11 软安科技有限公司 Cross-platform binary code matching method and system based on code features
CN118426872A (en) * 2024-05-27 2024-08-02 中国科学技术大学 A cross-platform tensor program performance prediction method

Citations (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060869B1 (en) * 2007-06-08 2011-11-15 Oracle America, Inc. Method and system for detecting memory problems in user programs
CN104536898A (en) * 2015-01-19 2015-04-22 浙江大学 C-program parallel region detecting method
US20150161028A1 (en) * 2013-12-09 2015-06-11 International Business Machines Corporation System and method for determining test coverage
US20180060580A1 (en) * 2016-09-01 2018-03-01 Cylance Inc. Training a machine learning model for container file analysis
US20180204002A1 (en) * 2017-01-18 2018-07-19 New York University Determining an aspect of behavior of an embedded device such as, for example, detecting unauthorized modifications of the code and/or behavior of an embedded device
CN108416219A (en) * 2018-03-18 2018-08-17 西安电子科技大学 A kind of Android binary files leak detection method and system
CN108647520A (en) * 2018-05-15 2018-10-12 浙江大学 A kind of intelligent fuzzy test method and system based on fragile inquiry learning
US20180365139A1 (en) * 2017-06-15 2018-12-20 Microsoft Technology Licensing, Llc Machine learning for constrained mutation-based fuzz testing
CN109308415A (en) * 2018-09-21 2019-02-05 四川大学 A binary-oriented fuzzing method and system
CN110008710A (en) * 2019-04-15 2019-07-12 上海交通大学 Vulnerability detection method based on deep reinforcement learning and program path instrumentation
US20200184070A1 (en) * 2018-12-06 2020-06-11 Nec Laboratories America, Inc. Confidential machine learning with program compartmentalization
CN111460472A (en) * 2020-03-20 2020-07-28 西北大学 Encryption algorithm identification method based on deep learning graph network
US10762200B1 (en) * 2019-05-20 2020-09-01 Sentinel Labs Israel Ltd. Systems and methods for executable code detection, automatic feature extraction and position independent code detection
CN111639344A (en) * 2020-07-31 2020-09-08 中国人民解放军国防科技大学 Vulnerability detection method and device based on neural network
CN111859388A (en) * 2020-06-30 2020-10-30 广州大学 A Multi-level Hybrid Vulnerability Automatic Mining Method
CN112328505A (en) * 2021-01-04 2021-02-05 中国人民解放军国防科技大学 A method and system for improving the coverage of fuzz testing
EP3812886A1 (en) * 2019-10-24 2021-04-28 Eberhard Karls Universität Tübingen System and method for optimising programming codes
US20210157906A1 (en) * 2019-11-27 2021-05-27 Data Security Technologies LLC Systems and methods for proactive and reactive data security
CN113360915A (en) * 2021-06-09 2021-09-07 扬州大学 Intelligent contract multi-vulnerability detection method and system based on source code graph representation learning
CN113672908A (en) * 2021-07-31 2021-11-19 荣耀终端有限公司 Fixed point pile inserting method, related device and system
CN114064506A (en) * 2021-11-29 2022-02-18 电子科技大学 Binary program fuzzy test method and system based on deep neural network
CN114168454A (en) * 2021-11-23 2022-03-11 叶嵩 Asynchronous testing method based on dynamic pile inserting-pile pinning technology
US20220107793A1 (en) * 2021-12-14 2022-04-07 Intel Corporation Concept for Placing an Execution of a Computer Program
US20220121429A1 (en) * 2020-10-20 2022-04-21 Battelle Energy Alliance, Llc Systems and methods for architecture-independent binary code analysis
CN114579969A (en) * 2022-05-05 2022-06-03 北京邮电大学 Vulnerability detection method and device, electronic equipment and storage medium
CN115129591A (en) * 2022-06-28 2022-09-30 山东大学 Binary code-oriented reproduction vulnerability detection method and system
CN115202736A (en) * 2022-06-14 2022-10-18 北京理工大学 A control flow-oriented cross-platform binary function representation method and device

Patent Citations (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060869B1 (en) * 2007-06-08 2011-11-15 Oracle America, Inc. Method and system for detecting memory problems in user programs
US20150161028A1 (en) * 2013-12-09 2015-06-11 International Business Machines Corporation System and method for determining test coverage
CN104536898A (en) * 2015-01-19 2015-04-22 浙江大学 C-program parallel region detecting method
US20180060580A1 (en) * 2016-09-01 2018-03-01 Cylance Inc. Training a machine learning model for container file analysis
US20180204002A1 (en) * 2017-01-18 2018-07-19 New York University Determining an aspect of behavior of an embedded device such as, for example, detecting unauthorized modifications of the code and/or behavior of an embedded device
US20180365139A1 (en) * 2017-06-15 2018-12-20 Microsoft Technology Licensing, Llc Machine learning for constrained mutation-based fuzz testing
CN108416219A (en) * 2018-03-18 2018-08-17 西安电子科技大学 A kind of Android binary files leak detection method and system
CN108647520A (en) * 2018-05-15 2018-10-12 浙江大学 A kind of intelligent fuzzy test method and system based on fragile inquiry learning
CN109308415A (en) * 2018-09-21 2019-02-05 四川大学 A binary-oriented fuzzing method and system
US20200184070A1 (en) * 2018-12-06 2020-06-11 Nec Laboratories America, Inc. Confidential machine learning with program compartmentalization
CN110008710A (en) * 2019-04-15 2019-07-12 上海交通大学 Vulnerability detection method based on deep reinforcement learning and program path instrumentation
US10762200B1 (en) * 2019-05-20 2020-09-01 Sentinel Labs Israel Ltd. Systems and methods for executable code detection, automatic feature extraction and position independent code detection
EP3812886A1 (en) * 2019-10-24 2021-04-28 Eberhard Karls Universität Tübingen System and method for optimising programming codes
US20210157906A1 (en) * 2019-11-27 2021-05-27 Data Security Technologies LLC Systems and methods for proactive and reactive data security
CN111460472A (en) * 2020-03-20 2020-07-28 西北大学 Encryption algorithm identification method based on deep learning graph network
CN111859388A (en) * 2020-06-30 2020-10-30 广州大学 A Multi-level Hybrid Vulnerability Automatic Mining Method
CN111639344A (en) * 2020-07-31 2020-09-08 中国人民解放军国防科技大学 Vulnerability detection method and device based on neural network
US20220121429A1 (en) * 2020-10-20 2022-04-21 Battelle Energy Alliance, Llc Systems and methods for architecture-independent binary code analysis
CN112328505A (en) * 2021-01-04 2021-02-05 中国人民解放军国防科技大学 A method and system for improving the coverage of fuzz testing
CN113360915A (en) * 2021-06-09 2021-09-07 扬州大学 Intelligent contract multi-vulnerability detection method and system based on source code graph representation learning
CN113672908A (en) * 2021-07-31 2021-11-19 荣耀终端有限公司 Fixed point pile inserting method, related device and system
CN114168454A (en) * 2021-11-23 2022-03-11 叶嵩 Asynchronous testing method based on dynamic pile inserting-pile pinning technology
CN114064506A (en) * 2021-11-29 2022-02-18 电子科技大学 Binary program fuzzy test method and system based on deep neural network
US20220107793A1 (en) * 2021-12-14 2022-04-07 Intel Corporation Concept for Placing an Execution of a Computer Program
CN114579969A (en) * 2022-05-05 2022-06-03 北京邮电大学 Vulnerability detection method and device, electronic equipment and storage medium
CN115202736A (en) * 2022-06-14 2022-10-18 北京理工大学 A control flow-oriented cross-platform binary function representation method and device
CN115129591A (en) * 2022-06-28 2022-09-30 山东大学 Binary code-oriented reproduction vulnerability detection method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
苏璞睿: "软件漏洞自动利用研究综述" *
赵尚儒,李学俊: "安全漏洞自动利用综述" *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361182A (en) * 2023-04-03 2023-06-30 南京航空航天大学 Symbol execution method for error state guidance
CN116361182B (en) * 2023-04-03 2023-12-05 南京航空航天大学 Symbol execution method for error state guidance
CN116578979A (en) * 2023-05-15 2023-08-11 软安科技有限公司 Cross-platform binary code matching method and system based on code features
CN116578979B (en) * 2023-05-15 2024-05-31 软安科技有限公司 Cross-platform binary code matching method and system based on code features
CN118426872A (en) * 2024-05-27 2024-08-02 中国科学技术大学 A cross-platform tensor program performance prediction method
CN118426872B (en) * 2024-05-27 2025-03-25 中国科学技术大学 A cross-platform tensor program performance prediction method

Also Published As

Publication number Publication date
CN115576840B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN115576840B (en) Static program pile insertion detection method and device based on machine learning
Meng et al. Improving fault localization and program repair with deep semantic features and transferred knowledge
Russell et al. Automated vulnerability detection in source code using deep representation learning
Sun et al. ASSBert: Active and semi-supervised bert for smart contract vulnerability detection
CN111897946B (en) Vulnerability patching recommended methods, systems, computer equipment and storage media
CN113626324B (en) Fuzzy test method for Move language virtual machine
CN108446540A (en) Program code based on source code multi-tag figure neural network plagiarizes type detection method and system
CN119105965A (en) A unit test case generation system based on large language model
CN115033895B (en) Binary program supply chain safety detection method and device
CN111444513B (en) A method and device for identifying firmware compilation optimization options for power grid embedded terminals
CN113591093A (en) Industrial software vulnerability detection method based on self-attention mechanism
CN116150757A (en) A detection method for unknown vulnerabilities in smart contracts based on CNN-LSTM multi-classification model
Zhang et al. How effective are they? exploring large language model based fuzz driver generation
CN115712760B (en) A binary code digest generation method and system based on BERT model and deep equal-length convolutional neural network
CN116743363A (en) Cipher function identification method based on cyclic analysis and binary similarity code analysis
CN115033884A (en) Binary code vulnerability detection method based on danger function parameter dependence
CN114064472B (en) Automatic software defect repairing acceleration method based on code representation
CN115129320B (en) A method and device for identifying indirect jump target address based on loop invariant
CN114065221B (en) Intelligent contract vulnerability detection method and device, electronic equipment and storage medium
CN115729612A (en) Source code and binary code matching method and device based on function call
CN115687136A (en) Script program processing method, system, computer equipment and medium
Benali An Initial Investigation of Neural Decompilation for WebAssembly
Van Thuy et al. Automated large program repair based on big code
Louloudakis et al. OODTE: A Differential Testing Engine for the ONNX Optimizer
CN119065988B (en) Method and device for generating cascaded unit test cases with loosely coupled language models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant