CN117034218A

CN117034218A - Apk confusion mechanism detection method based on graph attention network and NLP word embedding model

Info

Publication number: CN117034218A
Application number: CN202310897727.3A
Authority: CN
Inventors: 冯鹏斌; 吕旋; 盖乐; 李腾; 习宁; 卢笛; 马建峰
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-11-10

Abstract

The application provides an apk confusion mechanism detection method based on a graph attention network and an NLP word embedding model, which obtains an initial ICFG by acquiring apk to be detected and performing static analysis; regularizing each jimple instruction of the apk to be detected, and obtaining a vector corresponding to each instruction; accumulating and averaging the vectors according to the corresponding dimension to obtain the vector of each basic block, and combining the ICFG structural information to obtain a final code attribute map; and obtaining the confusion type of the apk to be detected by training the confusion detection model completed by the basic block feature matrix and the adjacent matrix of the final code attribute diagram. According to the application, the word2vec model is used for converting the jimple code block into the vector, so that semantic information of the jimple code can be integrated to the greatest extent, and structural information of the jimple code block can be learned by combining the GAT model, thereby having higher accuracy.

Description

APK confusion mechanism detection method based on graph attention network and NLP word embedding model

技术领域Technical field

本发明属于网络安全技术领域，具体涉及一种基于图注意力网络与NLP词嵌入模型的apk混淆机制检测方法。The invention belongs to the field of network security technology, and specifically relates to an APK confusion mechanism detection method based on graph attention network and NLP word embedding model.

背景技术Background technique

由于代码混淆技术的发展，越来越多的恶意软件编写者会通过混淆技术来修改软件的恶意代码，使代码很难被自动化的恶意代码检测工具发现，并且代码仍然具有恶意功能，混淆技术在恶意代码方面的使用也为技术人员逆向恶意代码软件提高了难度。因此，对于代码混淆方法的检测有助于快速确定恶意代码的混淆方法，为软件的恶意性检测以及安全逆向工程师的逆向分析提供便利。Due to the development of code obfuscation technology, more and more malware writers will use obfuscation technology to modify the malicious code of the software, making the code difficult to be discovered by automated malicious code detection tools, and the code still has malicious functions. Obfuscation technology is in The use of malicious code also makes it more difficult for technicians to reverse the malicious code software. Therefore, the detection of code obfuscation methods helps to quickly determine the obfuscation methods of malicious code, which facilitates the detection of software maliciousness and the reverse analysis of security reverse engineers.

现有的技术方案中公开了一种基于图卷积神经网络的函数级混淆检测方法(申请号为202011521368.4)，该方案针对函数级混淆机制进行检测，针对复杂混淆机制该方案适应性并不高。且该方案在特征提取时，选取的是反编译指令中的统计特征，缺乏代码语义信息的提取，因此该方案的识别精度低。An existing technical solution discloses a function-level confusion detection method based on a graph convolutional neural network (application number is 202011521368.4). This solution detects function-level confusion mechanisms. This solution is not highly adaptable to complex confusion mechanisms. . Moreover, when extracting features, this scheme selects statistical features in the decompilation instructions and lacks the extraction of code semantic information. Therefore, the recognition accuracy of this scheme is low.

发明内容Contents of the invention

为了解决现有技术中存在的上述问题，本发明提供了一种基于图注意力网络与NLP词嵌入模型的apk混淆机制检测方法。本发明要解决的技术问题通过以下技术方案实现：In order to solve the above-mentioned problems existing in the prior art, the present invention provides an APK confusion mechanism detection method based on the graph attention network and the NLP word embedding model. The technical problems to be solved by the present invention are achieved through the following technical solutions:

本发明提供了一种基于图注意力网络与NLP词嵌入模型的apk混淆机制检测方法包括：The present invention provides an APK confusion mechanism detection method based on graph attention network and NLP word embedding model, including:

S100，获取待检测apk，并对待检测apk进行流程分析得到待检测apk的初始过程间控制流图ICFG；初始ICFG包括：函数的调用图CG以及每个函数内的Jimple代码块，每个Jimple代码块包括多个指令；S100, obtain the apk to be detected, and perform process analysis on the apk to be detected to obtain the initial inter-process control flow graph ICFG of the apk to be detected; the initial ICFG includes: the call graph CG of the function and the Jimple code block within each function, each Jimple code A block consists of multiple instructions;

S200，将待检测apk的每个指令进行相应的正则化处理，并将处理后的指令转化为向量；S200, perform corresponding regularization processing on each instruction of the apk to be detected, and convert the processed instructions into vectors;

S300，将每个指令对应的向量按照对应维数累加求平均，获得每个Jimple代码块的检测向量，并将检测向量引入初始ICFG中得到待检测apk的最终代码属性图；S300: Accumulate and average the vectors corresponding to each instruction according to the corresponding dimensions to obtain the detection vector of each Simple code block, and introduce the detection vector into the initial ICFG to obtain the final code attribute map of the apk to be detected;

S400，将最终代码属性图的基本块特征矩阵和邻接矩阵通过训练完成的混淆检测模型进行信息传递、整体信息提取以及损失处理得到待检测apk的混淆类别。S400: Pass the basic block feature matrix and adjacency matrix of the final code attribute graph through the trained confusion detection model to perform information transfer, overall information extraction and loss processing to obtain the confusion category of the apk to be detected.

本发明提供了一种基于图注意力网络与NLP词嵌入模型的apk混淆机制检测方法，通过获取待检测apk，并对待检测apk进行静态分析得到待检测apk的初始过程间控制流图ICFG，将待检测apk的每个指令转化为向量，并对每个向量进行正则化处理得到每个指令对应的向量；将每个指令对应的向量按照对应维数进行累计求平均，获得每个Jimple代码块的检测向量，并将检测向量引入初始ICFG中得到待检测apk的最终代码属性图；将最终代码属性图的基本块特征矩阵和邻接矩阵通过训练完成的混淆检测模型进行信息传递、整体信息提取以及损失处理得到待检测apk的混淆类别。本发明通过得到apk的jimple代码块，使用word2vec模型将jimple代码块转化为向量，能够最大程度的整合jimple代码的语义信息，并且结合GAT学习jimple代码块的结构信息，相比于其他的神经网络方法具有更高的精准度。The present invention provides an apk confusion mechanism detection method based on graph attention network and NLP word embedding model. By obtaining the apk to be detected and statically analyzing the apk to be detected, the initial inter-process control flow graph ICFG of the apk to be detected is obtained. Each instruction of the apk to be detected is converted into a vector, and each vector is regularized to obtain the vector corresponding to each instruction; the vector corresponding to each instruction is accumulated and averaged according to the corresponding dimension to obtain each Jimple code block The detection vector is introduced into the initial ICFG to obtain the final code attribute graph of the apk to be detected; the basic block feature matrix and adjacency matrix of the final code attribute graph are passed through the trained confusion detection model for information transfer, overall information extraction and Loss processing obtains the obfuscation category of the apk to be detected. By obtaining the jimple code block of the apk and using the word2vec model to convert the jimple code block into a vector, the present invention can integrate the semantic information of the jimple code to the greatest extent and combine it with GAT to learn the structural information of the jimple code block. Compared with other neural networks The method has higher accuracy.

以下将结合附图及实施例对本发明做进一步详细说明。The present invention will be further described in detail below with reference to the accompanying drawings and examples.

附图说明Description of the drawings

图1是本发明提供的基于图注意力网络与NLP词嵌入模型的apk混淆机制检测方法的流程示意图；Figure 1 is a schematic flow chart of the apk confusion mechanism detection method based on the graph attention network and NLP word embedding model provided by the present invention;

图2是本发明提供的基于图注意力网络与NLP词嵌入模型的apk混淆机制检测方法的示例示意图；Figure 2 is a schematic diagram of an example of the apk confusion mechanism detection method based on the graph attention network and NLP word embedding model provided by the present invention;

图3是本发明提供的训练混淆检测模型的示意图。Figure 3 is a schematic diagram of training a confusion detection model provided by the present invention.

具体实施方式Detailed ways

下面结合具体实施例对本发明做进一步详细的描述，但本发明的实施方式不限于此。The present invention will be described in further detail below with reference to specific examples, but the implementation of the present invention is not limited thereto.

结合图1和图2，本发明提供了一种基于图注意力网络与NLP词嵌入模型的apk混淆机制检测方法包括：Combining Figures 1 and 2, the present invention provides an APK confusion mechanism detection method based on graph attention network and NLP word embedding model, including:

S100，获取待检测apk，并对待检测apk进行静态分析得到待检测apk的初始过程间控制流图ICFG；初始ICFG包括：函数的调用图CG以及每个函数内的Jimple代码块，每个Jimple代码块包括多个指令；S100, obtain the apk to be detected, and perform static analysis on the apk to be detected to obtain the initial inter-process control flow graph ICFG of the apk to be detected; the initial ICFG includes: the call graph CG of the function and the Simple code block within each function, each Simple code A block consists of multiple instructions;

本发明的S200包括：S200 of the present invention includes:

S210，利用预设的word2vec模型将每个指令转化为向量；S210, use the preset word2vec model to convert each instruction into a vector;

由于NLP词嵌入模型存在多种，本发明的Word2vec可替换为Glove、FastText、Doc2Vec等模型。Since there are many NLP word embedding models, Word2vec of the present invention can be replaced by Glove, FastText, Doc2Vec and other models.

S220，对每个向量进行正则化处理以去掉jimple代码块中的自定义变量名，获得Jimple代码块的代码文本；S220, perform regularization processing on each vector to remove custom variable names in the simple code block, and obtain the code text of the simple code block;

值得说明的是：jimple代码块是使用flowdroid处理apk之后得到的代码。word2vec是一个自然语言数据处理的模型，这个模型通过输入文本进行训练，训练完成后输入一个jimple代码的指令，输出一个相应的向量。It is worth explaining that the jimple code block is the code obtained after using flowdroid to process the apk. word2vec is a natural language data processing model. This model is trained by inputting text. After the training is completed, a jimple code instruction is input and a corresponding vector is output.

本发明通过flowdroid处理后，可以得到一个apk的调用图(Call Graph)和函数之间的互相调用关系展示出来。如此本发明就得到了一个图结构，得到图结构之后需要确定图中的每一个节点的向量，即将每一个方法对应的代码块转化为向量，而且这些向量的维数必须是同一个的，不然是没办法进行训练的。After being processed by flowdroid, the present invention can obtain a call graph (Call Graph) of an apk and display the mutual calling relationship between functions. In this way, the present invention obtains a graph structure. After obtaining the graph structure, it is necessary to determine the vector of each node in the graph, that is, convert the code block corresponding to each method into a vector, and the dimensions of these vectors must be the same, otherwise There is no way to train.

本发明对每个向量进行相应的正则化处理，以缩小用户自定义变量而引入的代码差异。正则化处理的方法是将其中自定义的变量名全部转化为一个相同的变量名，目的是缩小用户自定义变量而引入的代码差异，去掉jimple代码中的自定义变量名，只留下相应的jimple代码。然后处理之后的代码文本放入到word2vec文件中进行训练，这样本发明就可以得到每一个jimple指令对应的向量了。This invention performs corresponding regularization processing on each vector to reduce code differences introduced by user-defined variables. The regularization method is to convert all the custom variable names into the same variable name. The purpose is to reduce the code differences introduced by user-defined variables. The custom variable names in the jimple code are removed, leaving only the corresponding ones. jimple code. Then the processed code text is put into the word2vec file for training, so that the present invention can obtain the vector corresponding to each jimple instruction.

S230，将代码文本放入word2vec文件中进行训练，得到每一个指令对应的向量，所有向量的维数相同。S230, put the code text into the word2vec file for training, and obtain the vector corresponding to each instruction. The dimensions of all vectors are the same.

本发明的S300包括：S300 of the present invention includes:

S310，将每个指令对应的向量按照对应维数进行累计求平均得到一个平均向量；S310: Accumulate and average the vectors corresponding to each instruction according to the corresponding dimensions to obtain an average vector;

S320，将平均向量确定为jimple代码块的向量。S320: Determine the average vector as the vector of the jimple code block.

本发明转化为向量的方法是使用word2vec，每一个代码块的指令数量肯定是不一样的，所以为了统一起来，本发明使用word2vec把每一个代码块中的每一个指令给转化成向量，这样处理完代码块本发明就得到了多个向量，然后使用平均的方法把这些向量转化成一个向量。举例说明，假设一个方法有两个指令，它处理后就是[[1，2，3]，[5，6，7]]，然后平均化后就是([1，2，3]+[5，6，7])/2＝[3，4，5]，然后每一个代码块都进行相应的处理。之后，本发明所有函数的向量维数就是统一的了，然后把这些向量和调用图结合起来，就可以得到一个完整的图结构了。之前得到的调用图是没有每一个节点的向量值，需要计算出向量值，结合调用图就得到完整的图结构，就可以进行神经网络的训练或者识别。The method of converting into vectors in this invention is to use word2vec. The number of instructions in each code block is definitely different, so in order to unify, this invention uses word2vec to convert each instruction in each code block into a vector, and process it like this After completing the code block, the present invention obtains multiple vectors, and then uses the averaging method to convert these vectors into one vector. For example, suppose a method has two instructions. After processing, it is [[1, 2, 3], [5, 6, 7]], and then after averaging, it is ([1, 2, 3] + [5, 6, 7])/2 = [3, 4, 5], and then each code block is processed accordingly. After that, the vector dimensions of all functions of the present invention are unified, and then by combining these vectors with the call graph, a complete graph structure can be obtained. The call graph obtained before does not have the vector value of each node. The vector value needs to be calculated and combined with the call graph to obtain a complete graph structure, which can be used for neural network training or identification.

本发明得到相应的jimple指令的向量之后，将会对每一个代码块内的向量进行累加后平均操作，得到每一个代码块的向量。After the present invention obtains the vector of the corresponding jimple instruction, the vectors in each code block will be accumulated and averaged to obtain the vector of each code block.

本发明的S400包括：S400 of the present invention includes:

S410，确定最终代码属性图的基本块特征矩阵和邻接矩阵；S410, determine the basic block feature matrix and adjacency matrix of the final code attribute graph;

S420，将基本块特征矩阵和邻接矩阵通过训练完后的混淆检测模型的每一层进行传递，并且在传递过程中进行降维、拼接、整体信息提取、损失处理得到待检测apk的混淆类别。S420, transfer the basic block feature matrix and adjacency matrix through each layer of the trained confusion detection model, and perform dimensionality reduction, splicing, overall information extraction, and loss processing during the transfer process to obtain the confusion category of the apk to be detected.

本发明的S420包括：S420 of the present invention includes:

S421，根据训练完后的混淆检测模型的每一层的注意力相关得分，将基本块特征矩阵和邻接矩阵通过训练完后的混淆检测模型的每一层进行传递，并且在传递过程进行降维、拼接、整体信息提取、损失处理得到待检测apk的混淆类别的概率分布；S421, according to the attention-related score of each layer of the trained confusion detection model, transfer the basic block feature matrix and adjacency matrix through each layer of the trained confusion detection model, and perform dimensionality reduction during the transfer process. , splicing, overall information extraction, and loss processing to obtain the probability distribution of the confusion category of the apk to be detected;

S422，将大于预设概率阈值的混淆类别确定为待检测apk的混淆类别。S422: Determine the confusion category that is greater than the preset probability threshold as the confusion category of the apk to be detected.

最终代码属性图的基本块特征矩阵表示为X∈R^N×M，邻接矩阵表示为A∈R^N×N。其中，N代表的是节点的总个数，M代表的是每一个节点向量的维数，本发明使用5层的GAT图卷积神经网络，它的逐层传播的规则如下：The basic block feature matrix of the final code attribute graph is expressed as X∈R ^N×M , and the adjacency matrix is expressed as A∈R ^N×N . Among them, N represents the total number of nodes, and M represents the dimension of each node vector. The present invention uses a 5-layer GAT graph convolutional neural network, and its layer-by-layer propagation rules are as follows:

其中，代表的是第i个节点在下一层时的向量值。a_ij是注意力互相关得分，代表着节点j对节点i的贡献。j是与节点i相邻的其他节点，N表示与节点i相邻的节点数。其中a_ij通过以下方式进行计算：in, It represents the vector value of the i-th node when it is in the next layer. a _ij is the attention cross-correlation score, which represents the contribution of node j to node i. j is the other nodes adjacent to node i, and N represents the number of nodes adjacent to node i. where a _ij is calculated by:

其中，W∈R^M′×M是一个可训练的矩阵，他的主要作用是负责将h_j的向量进行一个转化，给对向量进行一个降维的处理。⊕代表两个节点的向量的拼接，LeakyReLU表示的是一个非线性的激活函数。Among them, W∈R ^M′×M is a trainable matrix. Its main function is to transform the vector of h _j and perform dimensionality reduction on the vector. ⊕ represents the splicing of vectors from two nodes, and LeakyReLU represents a nonlinear activation function.

将一个CFG中所有的节点的嵌入求和，即为整个图的嵌入 Summing the embeddings of all nodes in a CFG is the embedding of the entire graph.

使用LSTM对图的嵌入进行处理，进一步提取整体信息，并且加入Dropout，Dropout是一种常用的正则化方法，在正向传递和权值更新的过程中概率性的断开部分神经元有效的避免过拟合问题的发生。LSTM层输出为s∈R^k，其中k为要检验混淆技术类别的数量：Use LSTM to process the embedding of the graph, further extract the overall information, and add Dropout. Dropout is a commonly used regularization method that effectively avoids probabilistic disconnection of some neurons during the forward transfer and weight update process. The overfitting problem occurs. The output of the LSTM layer is s∈R ^k , where k is the number of categories of confusion techniques to be tested:

s＝Dropout(LSTM(g)) (7)；s=Dropout(LSTM(g)) (7);

对LSTM的输出使用softmax层进行处理，softmax层的输出为y∈R^k。The output of LSTM is processed using the softmax layer, and the output of the softmax layer is y∈R ^k .

y＝softmax(s) (8)；y=softmax(s) (8);

式中，y是所有分类的概率分布，根据y得到该函数所使用的混淆技术的分类结果。In the formula, y is the probability distribution of all classifications, and based on y, the classification result of the confusion technology used by the function is obtained.

参见图2，训练完成的混淆检测模型的训练过程包括：Referring to Figure 2, the training process of the completed confusion detection model includes:

S010，从数据库中获取经过混淆处理的apk集合；S010, obtain the obfuscated apk collection from the database;

其中，apk集合包括：普通混淆集合，重命名混淆集合，加密混淆集合，反射混淆集合，代码混淆集合和多种方式混合混淆集合。Among them, the apk collection includes: ordinary confusion collection, rename confusion collection, encryption confusion collection, reflection confusion collection, code confusion collection and various mixed confusion collections.

普通混淆：顾名思义，此类别包括简单的操作方法，这些方法并没有对apk产生真正的混淆效果，但是这些方法可以欺骗一些基于签名的恶意软件，方法包括：使用新签名对apk文件进行签名，classes.dex文件中包含的字节码进行反汇编和重新汇编以获得不同版本的文件等。Common obfuscation: As the name suggests, this category includes simple methods that do not have a real obfuscation effect on the apk, but these methods can fool some signature-based malware. Methods include: signing the apk file with a new signature, classes The bytecode contained in the .dex file is disassembled and reassembled to obtain different versions of the file, etc.

重命名混淆：在软件开发中，标识符的名称(变量名称、函数名称等)应该有意义以提供代码良好的可读性与可维护性。然而，这样的明确的名称可能会泄露代码的相关功能信息。此外，由于包名唯一标识了一个Android应用程序，因此它的修改相当于将一个新的应用程序安装到了系统中。因此，重命名技术将每个标识符替换为晦涩且无意义的标识符。尽管对方法和字段重命名不会对程序的正常运行没有任何影响，但是对类和包名的重命名更加复杂，因为AndroidManifest.xml必须相应更新才能够使软件正常运行。Renaming confusion: In software development, identifier names (variable names, function names, etc.) should be meaningful to provide good readability and maintainability of the code. However, such explicit names may reveal relevant functional information about the code. In addition, since the package name uniquely identifies an Android application, its modification is equivalent to installing a new application into the system. Therefore, the renaming technique replaces every identifier with an obscure and meaningless one. Although renaming methods and fields will not have any impact on the normal operation of the program, renaming class and package names is more complicated because AndroidManifest.xml must be updated accordingly for the software to run properly.

加密混淆：APK文件需要的一些资源可以在运行的时候进行请求，这些资源包括本地的链接库或者使一些字符串类型的文本资源。这些文件可以进行加密操作并且在运行的时候进行解密。在这种情况下恶意软件的分析师需要首先得到加密的密钥才能完成对资源文件的访问，但是这种混淆方法会降低软件的运行效率。Encryption obfuscation: Some resources required by the APK file can be requested at runtime. These resources include local link libraries or some string type text resources. These files can be encrypted and decrypted at runtime. In this case, malware analysts need to first obtain the encrypted key to complete access to the resource file, but this obfuscation method will reduce the operating efficiency of the software.

反射混淆：反射是java语言的特性，使java的一个类在运行的时候进行修改与测试，这种特性通常被用于调用一个给定对象的方法。这种技术的基本原理是，查询APK可执行代码中对APK文件类的调用而不是Android系统文件类的调用，然后检查这些调用中能不能使用反射的API进行调用，这样可以掩盖APK中的互相调用关系而不影响代码的正常运行。Reflection confusion: Reflection is a feature of the Java language that allows a Java class to be modified and tested at runtime. This feature is usually used to call methods of a given object. The basic principle of this technology is to query the APK executable code for calls to APK file classes instead of calls to Android system file classes, and then check whether these calls can be made using reflective API calls, which can cover up mutual interactions in the APK. The calling relationship does not affect the normal operation of the code.

代码混淆：这一类混淆技术主要会影响classes.dex文件中的内容，具体的方法有删除调试过程中有用的信息，加入无意义的算数分支以及加入新的方法实现间接调用等。Code obfuscation: This type of obfuscation technology mainly affects the content in the classes.dex file. Specific methods include deleting useful information during debugging, adding meaningless arithmetic branches, and adding new methods to implement indirect calls.

多种混淆方式的混淆：通过以上两种或更多种混淆方法的混合。Multi-method obfuscation: A mixture of two or more of the above obfuscation methods.

S020，对apk集合中的每个apk进行静态分析，获得每个apk的过程间控制流图ICFG；S020, perform static analysis on each apk in the apk collection, and obtain the inter-process control flow graph ICFG of each apk;

其中，ICFG包括函数的调用图CG以及每个函数内的Jimple代码块，每个Jimple代码块包括多个指令。Among them, ICFG includes the call graph CG of the function and the Simple code block within each function. Each Simple code block includes multiple instructions.

本发明采用的是目前已经被大量使用的flowdroid开源软件，FlowDroid是一款适用于Android应用程序的静态污点分析工具，使用了基于IFDS的流函数来提高敏感性，同时对Android生命周期和回调进行完整建模，并生成一个特殊的main方法，以确保不会丢失任何可用信息，该工具有较高的分析精度与可靠性。通过FlowDroid工具获得调用图和每个函数的jimple代码来实现控制流图的构建。This invention uses the flowdroid open source software that has been widely used. FlowDroid is a static taint analysis tool suitable for Android applications. It uses IFDS-based flow functions to improve sensitivity and simultaneously conducts analysis on the Android life cycle and callbacks. Complete modeling and generation of a special main method to ensure that no available information is lost. This tool has high analysis accuracy and reliability. The call graph and the jimple code of each function are obtained through the FlowDroid tool to realize the construction of the control flow graph.

S030，根据每个apk的ICFG构建该apk对应的图数据结构；S030, construct the graph data structure corresponding to the apk according to the ICFG of each apk;

本发明基于每个函数的Jimple代码，使用word2vec模型将其转化为向量，然后结合函数的CG，构建相应的图数据结构。This invention is based on the Simple code of each function, uses the word2vec model to convert it into a vector, and then combines the CG of the function to build the corresponding graph data structure.

S040，根据每个apk对应的图数据结构结合预设的GAT神经网络结构建立GAT神经网络模型；S040, establish a GAT neural network model based on the graph data structure corresponding to each apk and the preset GAT neural network structure;

S050，将GAT神经网络模型作为混淆检测模型；S050, use the GAT neural network model as a confusion detection model;

S060，对混淆检测模型进行训练得到训练完成的混淆检测模型。S060, train the confusion detection model to obtain the trained confusion detection model.

本发明通过得到apk的jimple代码块，使用word2vec模型，能够最大程度的整合jimple代码的语义信息，并且结合GAT学习jimple代码的结构信息，相比于其他的神经网络方法具有更高的精准度。本发明直接采用flowdroid获取AndroOBFS Dataset的jimple代码来获取神经网络的数据集合，以用于图神经网络的训练，相比于某些混淆识别技术本发明省略了对开源代码混淆与打标签的步骤，获取数据集合更加方便快捷。By obtaining the jimple code block of the apk and using the word2vec model, the present invention can integrate the semantic information of the jimple code to the greatest extent, and combines GAT to learn the structural information of the jimple code, which has higher accuracy than other neural network methods. The present invention directly uses flowdroid to obtain the jimple code of the AndroOBFS Dataset to obtain the data set of the neural network for training of the graph neural network. Compared with some confusion recognition technologies, the present invention omits the steps of confusing and labeling the open source code. Obtaining data collections is more convenient and faster.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本发明的描述中，“多个”的含义是两个或两个以上，除非另有明确具体的限定。In addition, the terms “first” and “second” are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the present invention, "plurality" means two or more than two, unless otherwise explicitly and specifically limited.

尽管在此结合各实施例对本申请进行了描述，然而，在实施所要求保护的本申请过程中，本领域技术人员通过查看所述附图、公开内容、以及所附权利要求书，可理解并实现所述公开实施例的其他变化。在权利要求中，“包括”(comprising)一词不排除其他组成部分或步骤，“一”或“一个”不排除多个的情况。Although the present application has been described herein in connection with various embodiments, in practicing the claimed application, those skilled in the art will understand and understand by reviewing the drawings, the disclosure, and the appended claims. Other variations of the disclosed embodiments are implemented. In the claims, the word "comprising" does not exclude other components or steps, and "a" or "an" does not exclude a plurality.

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in combination with specific preferred embodiments, and it cannot be concluded that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field to which the present invention belongs, several simple deductions or substitutions can be made without departing from the concept of the present invention, and all of them should be regarded as belonging to the protection scope of the present invention.

Claims

1. An APK confusion mechanism detection method based on graph attention network and NLP word embedding model, which is characterized by:

S100, obtain the apk to be detected, and perform static analysis on the apk to be detected to obtain the initial inter-process control flow graph ICFG of the apk to be detected; the initial ICFG includes: the call graph CG of the function and the Simple code block in each function , each Simple code block includes multiple instructions;

S200, perform corresponding regularization processing on each instruction of the apk to be detected, and convert the processed instructions into vectors;

S300: Accumulate and average the vectors corresponding to each instruction according to the corresponding dimensions to obtain the detection vector of each Simple code block, and introduce the detection vector into the initial ICFG to obtain the final code of the apk to be detected. property map;

S400: Pass the basic block feature matrix and adjacency matrix of the final code attribute graph through the trained confusion detection model to perform information transfer, overall information extraction and loss processing to obtain the confusion category of the apk to be detected.

2. APK confusion mechanism detection method based on graph attention network and NLP word embedding model according to claim 1, characterized in that the training process of the confusion detection model completed by training includes:

S010, obtain the obfuscated apk collection from the database;

S020, perform static analysis on each apk in the apk set, and obtain the inter-process control flow graph ICFG of each apk;

S030, construct the graph data structure corresponding to the apk according to the ICFG of each apk;

S040, establish a GAT neural network model based on the graph data structure corresponding to each apk and the preset GAT neural network structure;

S050, use the GAT neural network model as a confusion detection model;

S060: Train the confusion detection model to obtain the trained confusion detection model.

3. The apk confusion mechanism detection method based on graph attention network and NLP word embedding model according to claim 2, characterized in that the apk set in S010 includes: ordinary confusion set, rename confusion set, encrypted confusion set, Reflection obfuscation set, code obfuscation set and obfuscation set mixed in various ways.

4. The apk confusion mechanism detection method based on graph attention network and NLP word embedding model according to claim 2, characterized in that the ICFG in S020 includes the call graph CG of the function and the Jimple code block within each function, Each Jimple code block consists of multiple instructions.

5. The apk confusion mechanism detection method based on graph attention network and NLP word embedding model according to claim 1, characterized in that S200 includes:

S210, use the preset word2vec model to convert each instruction into a vector;

S220, perform regularization processing on each vector to remove custom variable names in the simple code block, and obtain the code text of the simple code block;

S230: Put the code text into the word2vec file for training, and obtain the vector corresponding to each instruction. The dimensions of all vectors are the same.

6. APK confusion mechanism detection method based on graph attention network and NLP word embedding model according to claim 1, characterized in that S300 includes:

S310: Accumulate and average the vectors corresponding to each instruction according to the corresponding dimensions to obtain an average vector;

S320: Determine the average vector as the vector of the jimple code block.

7. The apk confusion mechanism detection method based on graph attention network and NLP word embedding model according to claim 1, characterized in that S400 includes:

S410, determine the basic block feature matrix and adjacency matrix of the final code attribute graph;

S420, transfer the basic block feature matrix and adjacency matrix through each layer of the trained confusion detection model, and perform dimensionality reduction, splicing, overall information extraction, and loss processing during the transfer process to obtain the apk to be detected. category of confusion.

8. The apk confusion mechanism detection method based on graph attention network and NLP word embedding model according to claim 7, characterized in that, S420 includes:

S421. According to the attention-related score of each layer, the basic block feature matrix and the adjacency matrix are transferred through each layer of the trained confusion detection model, and dimensionality reduction, splicing, and overall information extraction are performed during the transfer process. , loss processing is performed to obtain the probability distribution of the confusion category of the apk to be detected;

S422: Determine the confusion category that is greater than the preset probability threshold as the confusion category of the apk to be detected.