CN115391778A

CN115391778A - Android malware detection method and device based on heterogeneous graph attention network

Info

Publication number: CN115391778A
Application number: CN202210983464.3A
Authority: CN
Inventors: 凌捷; 殷丹丽; 罗玉
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-08-16
Filing date: 2022-08-16
Publication date: 2022-11-25

Abstract

The present invention provides a method for detecting an Android malicious program based on a heterogeneous graph attention network, comprising the following steps: S1: downloading an APP and labeling it; S2: decompiling the APK, and extracting various key feature entities; S3: Build a heterogeneous graph attention network, convert the heterogeneous graph attention network into multiple meta-structures, and calculate the adjacency matrix of each meta-structure; S4: Obtain low-dimensional vector embedding; S5: Train the logistic regression model, and obtain the detection The node embedding of the Android application program; S6: Get the detection result. The present invention also provides an Android malware detection device based on a heterogeneous graph attention network, which is used to implement the above-mentioned Android malware detection method based on a heterogeneous graph attention network. The present invention provides a method and device for detecting Android malicious programs based on a heterogeneous graph attention network, which solves the problem that the existing malicious program detection technology cannot effectively classify and detect Android malicious applications.

Description

Android malware detection method and device based on heterogeneous graph attention network

技术领域technical field

本发明涉及信息安全的技术领域，更具体的，涉及一种基于异构图注意力网络的安卓恶意程序检测方法和装置。The present invention relates to the technical field of information security, and more specifically, to a method and device for detecting Android malicious programs based on a heterogeneous graph attention network.

背景技术Background technique

在互联网服务的高速发展带动下，移动应用已经进入到大众生活的各个方面，例如通信、金融、出行、娱乐等等，目前安卓已经是全球智能手机市场最大的操作系统平台，安卓平台的扩展性和开放性导致用户面临各种恶意程序的威胁和攻击，包括隐私侵犯、数据泄露、垃圾广告以及一些涉及用户个人财产安全的交易支付操作等，因此安卓恶意程序的识别与检测方法的研究具有重要的应用价值。Driven by the rapid development of Internet services, mobile applications have entered all aspects of public life, such as communication, finance, travel, entertainment, etc. At present, Android is the largest operating system platform in the global smartphone market, and the scalability of the Android platform Due to its openness and openness, users are faced with threats and attacks from various malicious programs, including privacy violations, data leaks, spam advertisements, and some transaction payment operations involving the safety of users’ personal property. Therefore, the research on identification and detection methods of Android malicious programs is of great importance. application value.

传统安卓恶意程序检测方法有针对APK文件的静态分析，对其中的项目清单以及代码文件、资源文件进行特征表征，再通过相似性比较，来判断是否为恶意，然而这种方法可能会因为简单的模糊处理而无法有效识别利用代码混淆技术和安卓漏洞的恶意应用程序，动态分析方法则是通过运行程序代码的方式收集系统信息，包括系统调用、API调用、网络信息等构建特征库，再通过相似性比较进行识别，弊端是严重依赖操作系统的版本和程序的运行时间，为了解决这个问题，目前基于机器学习算法的检测技术通过提取关键特征并应用分类算法来区分恶意还是良性，然而这种方法没有考虑到节点之间丰富的语义信息，无法检测出特征伪装隐藏的恶意应用程序。Traditional Android malware detection methods include static analysis of APK files, characterizing the list of items, code files, and resource files, and then comparing similarities to determine whether they are malicious. However, this method may be because of simple Malicious applications using code obfuscation technology and Android vulnerabilities cannot be effectively identified through fuzzy processing. The dynamic analysis method is to collect system information by running program codes, including system calls, API calls, network information, etc. to build a feature library, and then use similar The disadvantage is that it depends heavily on the version of the operating system and the running time of the program. In order to solve this problem, the current detection technology based on machine learning algorithms extracts key features and applies classification algorithms to distinguish between malicious and benign. However, this method Without taking into account the rich semantic information between nodes, it is impossible to detect malicious applications whose features are disguised and hidden.

因此，现有的恶意程序检测技术无法有效针对安卓恶意应用程序进行分类检测。Therefore, the existing malicious program detection technology cannot effectively classify and detect Android malicious applications.

发明内容Contents of the invention

本发明为克服现有的恶意程序检测技术无法有效针对安卓恶意应用程序进行分类检测的技术缺陷，提供一种基于异构图注意力网络的安卓恶意程序检测方法和装置。In order to overcome the technical defect that the existing malicious program detection technology cannot effectively classify and detect Android malicious applications, the present invention provides a method and device for detecting Android malicious programs based on a heterogeneous graph attention network.

为解决上述技术问题，本发明的技术方案如下：In order to solve the problems of the technologies described above, the technical solution of the present invention is as follows:

一种基于异构图注意力网络的安卓恶意程序检测方法，包括以下步骤：A kind of Android malware detection method based on heterogeneous graph attention network, comprises the following steps:

S1：下载安卓应用程序APP并进行标签，得到安卓应用程序集合；其中，安卓应用程序包括良性安卓应用程序和恶意安卓应用程序；S1: Download the Android application APP and label it to obtain a set of Android applications; wherein, the Android application includes a benign Android application and a malicious Android application;

S2：对安卓应用程序的安装包APK进行反编译，并从反编译后的文件中提取得到多种关键特征实体；S2: Decompile the installation package APK of the Android application program, and extract various key feature entities from the decompiled file;

S3：根据安卓应用程序与关键特征实体之间的关系构建异构图注意力网络，将异构图注意力网络转化为多个元结构，计算得到各个元结构的邻接矩阵；S3: Construct a heterogeneous graph attention network based on the relationship between Android applications and key feature entities, transform the heterogeneous graph attention network into multiple meta-structures, and calculate the adjacency matrix of each meta-structure;

S4：根据元结构的邻接矩阵获取已有节点的低维向量嵌入；S4: Obtain the low-dimensional vector embedding of existing nodes according to the adjacency matrix of the meta structure;

S5：利用已有节点的低维向量嵌入和标签训练逻辑回归模型，得到训练好的逻辑回归模型，以及获取待检测的安卓应用程序的节点嵌入；S5: Use the low-dimensional vector embedding and label of the existing nodes to train the logistic regression model, obtain the trained logistic regression model, and obtain the node embedding of the Android application to be detected;

S6：将待检测的安卓应用程序的节点嵌入输入训练好的逻辑回归模型进行检测，得到待检测的安卓应用程序为恶意或良性的检测结果。S6: Embed the nodes of the Android application to be detected into the trained logistic regression model for detection, and obtain a detection result indicating whether the Android application to be detected is malicious or benign.

上述方案中，首先通过对APK反编译提取得到多种关键特征实体，根据安卓应用程序与关键特征实体之间的关系构建异构图注意力网络，并将异构图注意力网络转化为多个元结构，然后由元结构的邻接矩阵获取已有节点的低维向量嵌入，利用低维向量嵌入和标签训练逻辑回归模型，最后获取待检测的安卓应用程序的节点嵌入并输入训练好的逻辑回归模型进行检测，得到待检测的安卓应用程序为恶意或良性的检测结果。In the above scheme, firstly, a variety of key feature entities are extracted by decompiling the APK, and a heterogeneous graph attention network is constructed according to the relationship between the Android application and the key feature entities, and the heterogeneous graph attention network is transformed into multiple Metastructure, then obtain the low-dimensional vector embedding of existing nodes from the adjacency matrix of the metastructure, use the low-dimensional vector embedding and label training logistic regression model, finally obtain the node embedding of the Android application to be tested and input the trained logistic regression The model detects and obtains the detection result that the Android application to be detected is malicious or benign.

优选的，所述关键特征实体包括API、权限、权限类型、类、接口和so文件。Preferably, the key feature entities include API, authority, authority type, class, interface and so file.

优选的，根据安卓应用程序与关键特征实体之间的关系形成图内关系矩阵Rl_in，l∈[1,6]；其中，R1_in表示App与API之间的关系，R2_in表示App与权限之间的关系，R3_in表示App所属的权限类型，R4_in表示App与类之间的关系，R5_in表示App与接口之间的关系，R6_in表示App与so文件之间的关系。Preferably, the relationship matrix _Rlin in the graph is formed according to the relationship between the Android application program and the key feature entity, l∈[1,6]; wherein, R1 _in represents the relationship between the App and the API, and R2 _in represents the relationship between the App and the authority R3 _in indicates the permission type of the App, R4 _in indicates the relationship between the App and the class, R5 _in indicates the relationship between the App and the interface, and R6 _in indicates the relationship between the App and the so file.

优选的，所述异构图注意力网络为图G＝(V，E，A，R)，其节点的类型包括APP、API、权限、权限类型、类、接口和so文件，边的类型包括R1_in、R2_in、R3_in、R4_in、R5_in和R6_in；其中，V表示节点的集合，E表示边的集合，A表示节点的类型集，R表示边的类型集，|A|+|R|>2。Preferably, the heterogeneous graph attention network is a graph G=(V, E, A, R), the types of its nodes include APP, API, authority, authority type, class, interface and so file, and the types of edges include R1 _in , R2 _in , R3 _in , R4 _in , R5 _in and R6 _in ; among them, V represents the set of nodes, E represents the set of edges, A represents the type set of nodes, R represents the type set of edges, |A|+ |R|>2.

优选的，所述元结构为元路径或元图，所述元路径是在异构图注意力网络上定义的路径，源对象和目标对象位于路径的两端，若源对象和目标对象之间有多条元路径则构成元图。Preferably, the meta-structure is a meta-path or a meta-graph, the meta-path is a path defined on a heterogeneous graph attention network, the source object and the target object are located at both ends of the path, if the source object and the target object Multiple meta-paths constitute a meta-graph.

优选的，由K个元结构的邻接矩阵组成邻接矩阵集合{Ψ^M1,...,Ψ^Mk,...,Ψ^MK}，元结构的邻接矩阵为元路径的邻接矩阵或元图的邻接矩阵，Preferably, the adjacency matrix set {Ψ ^M1 ,...,Ψ ^Mk ,...,Ψ ^MK } is composed of K adjacency matrices of the meta-structure, and the adjacency matrix of the meta-structure is the adjacency matrix of the meta-path or the adjacency of the meta-graph matrix,

其中，元路径的邻接矩阵计算公式为：Among them, the calculation formula of the adjacency matrix of the meta-path is:

Ψ^MP＝R_A1A2·...·R_AiA(i+1)·...·R_A(n-1)An Ψ ^MP ＝R _A1A2 ·...·R _AiA(i+1) ·...·R _A(n-1)An

元图的邻接矩阵计算公式为：The formula for calculating the adjacency matrix of the meta graph is:

Ψ^RG＝Ψ^MP1⊙...⊙Ψ^MPj⊙...⊙Ψ^MPm；Ψ ^RG = Ψ ^MP1 ⊙...⊙Ψ ^MPj ⊙...⊙Ψ ^MPm ;

其中，Ψ^Mk表示第k个元结构的邻接矩阵，R_AiA(i+1)表示第i个节点和第i+1个节点之间的关系矩阵，i＝1,2,...,n，n表示元路径节点的数量，Ψ^MPj表示第j个Ψ^MP，⊙表示哈达玛积，m表示Ψ^MP的数量。Among them, Ψ ^Mk represents the adjacency matrix of the k-th element structure, R _AiA(i+1) represents the relationship matrix between the i-th node and the i+1-th node, i=1,2,...,n , n represents the number of meta-path nodes, Ψ ^MPj represents the jth Ψ ^MP , ⊙ represents the Hadamard product, and m represents the number of Ψ ^MP .

优选的，步骤S4包括以下步骤：Preferably, step S4 includes the following steps:

S41：以one-hot向量形式对每个节点进行编码，得到矩阵H，将H和给定元结构Mk的邻接矩阵结合起来，通过归一化操作获得元结构内部节点的邻接矩阵：S41: Encode each node in the form of a one-hot vector to obtain a matrix H, combine H with the adjacency matrix of a given metastructure Mk, and obtain the adjacency matrix of the internal nodes of the metastructure through a normalization operation:

Ψ^Mk’＝Normalize(H·H^T⊙Ψ^Mk)Ψ ^Mk' ＝Normalize(H·H ^T ⊙Ψ ^Mk )

并引入边缘权重感知的GAT模型更新元结构Mk内部节点嵌入Φ^Mk＝GAT(H；Ψ^Mk’)；And introduce the GAT model of edge weight perception to update the meta-structure Mk internal node embedding Φ ^Mk = GAT(H; Ψ ^Mk' );

S42：利用多层感知器学习融合中每个元结构Mk的权重β^Mk，S42: Using a multi-layer perceptron to learn the weight β ^Mk of each meta-structure Mk in the fusion,

(β^M1,...,β^Mk,...,β^MK)＝softmax(NN(Φ^M1),...,NN(Φ^Mk),...,NN(Φ^MK))(β ^M1 ,...,β ^Mk ,...,β ^MK )＝softmax(NN(Φ ^M1 ),...,NN(Φ ^Mk ),...,NN(Φ ^MK ))

其中，NN是将给定矩阵映射为数值的原生神经网络，Among them, NN is a native neural network that maps a given matrix to a numerical value,

从而获得已有节点的低维向量嵌入：To obtain the low-dimensional vector embedding of existing nodes:

优选的，在步骤S5中，通过以下步骤获取待检测的安卓应用程序的节点嵌入：Preferably, in step S5, the node embedding of the Android application program to be detected is obtained through the following steps:

S51：根据待检测的安卓应用程序与关键特征实体之间的关系形成图外关系矩阵Rl_out，l∈[1,6]；S51: Form an out-of-graph relationship matrix Rl _out according to the relationship between the Android application to be detected and the key feature entity, l∈[1,6];

S52：形成节点邻接矩阵的增量段

形式为j行列矩阵，j表示图内节点的个数，矩阵的第j行数值

代表新节点与图内节点v_j之间元结构的数量；S52: Form an incremental segment of the node adjacency matrix

The form is a matrix of j rows and columns, j represents the number of nodes in the graph, and the value of the jth row of the matrix

Represents the number of metastructures between the new node and the node v _j in the graph;

S53：使用top-k算法对

进行排序，选出数值较大的前t个图内节点作为图内邻居节点v_s，s＝1,2,...,t，聚合新节点与图内邻居节点的向量，得到待检测的安卓应用程序的节点嵌入：S53: Use the top-k algorithm pair

Sort, select the first t nodes in the graph with larger values as the neighbor nodes v _s in the graph, s=1,2,...,t, aggregate the vectors of the new node and the neighbor nodes in the graph, and obtain the to-be-detected Node Embedding for Android Apps:

其中，

表示v_s在元路径Mk上的权重，

表示新节点与图内邻居节点v_s之间元结构的数量。in,

Indicates the weight of v _s on the meta-path Mk,

Indicates the number of metastructures between the new node and its neighbor nodes vs _s in the graph.

优选的，逻辑回归模型输出的预测值为：Preferably, the predicted value of the logistic regression model output is:

其中，b表示偏移参数，w表示权重，

表示待检测的安卓应用程序的节点嵌入；Among them, b represents the offset parameter, w represents the weight,

a node embedding representing the Android application to be detected;

当逻辑回归模型输出的预测值a大于0.5，则得到检测结果为恶意，否则，得到检测结果为良性。When the predicted value a output by the logistic regression model is greater than 0.5, the detection result is malicious; otherwise, the detection result is benign.

一种基于异构图注意力网络的安卓恶意程序检测装置，用于实现所述的一种基于异构图注意力网络的安卓恶意程序检测方法，包括：An Android malware detection device based on a heterogeneous graph attention network, used to implement the described Android malware detection method based on a heterogeneous graph attention network, comprising:

特征工程模块，用于对APP进行标签，并将APK反编译，提取关键特征实体；The feature engineering module is used to label the APP, decompile the APK, and extract key feature entities;

图构建模块，用于根据安卓应用程序与关键特征实体之间的关系以点和边的形式构建异构图注意力网络；还用于将异构图注意力网络转化为多个元结构，并计算各个元结构的邻接矩阵；Graph building blocks for constructing heterogeneous graph attention networks in the form of points and edges based on relationships between Android apps and key feature entities; also for transforming heterogeneous graph attention networks into multiple metastructures, and Calculate the adjacency matrix of each meta-structure;

节点聚合模块，用于获取安卓应用程序的节点嵌入，以及根据元结构的邻接矩阵获取节点的低维向量嵌入；The node aggregation module is used to obtain the node embedding of the Android application, and obtain the low-dimensional vector embedding of the node according to the adjacency matrix of the meta structure;

检测模块，用于通过节点的低维向量嵌入和标签学习分类，以及根据待检测的安卓应用程序的节点嵌入进行检测，输出待检测的安卓应用程序为恶意或良性的检测结果。The detection module is used to learn and classify through the low-dimensional vector embedding and label of the node, and detect according to the node embedding of the Android application to be detected, and output the detection result that the Android application to be detected is malicious or benign.

与现有技术相比，本发明技术方案的有益效果是：Compared with the prior art, the beneficial effects of the technical solution of the present invention are:

本发明提供了一种基于异构图注意力网络的安卓恶意程序检测方法和装置，首先通过对APK反编译提取得到多种关键特征实体，根据安卓应用程序与关键特征实体之间的关系构建异构图注意力网络，并将异构图注意力网络转化为多个元结构，然后由元结构的邻接矩阵获取已有节点的低维向量嵌入，利用低维向量嵌入和标签训练逻辑回归模型，最后获取待检测的安卓应用程序的节点嵌入并输入训练好的逻辑回归模型进行检测，得到待检测的安卓应用程序为恶意或良性的检测结果。The present invention provides a method and device for detecting an Android malicious program based on a heterogeneous graph attention network. First, multiple key feature entities are obtained by decompiling and extracting the APK, and a heterogeneous program is constructed according to the relationship between the Android application program and the key feature entity. Construct the attention network and convert the heterogeneous graph attention network into multiple meta-structures, then obtain the low-dimensional vector embedding of the existing nodes from the adjacency matrix of the meta-structure, and use the low-dimensional vector embedding and label to train the logistic regression model, Finally, the node embedding of the Android application to be detected is obtained and input into the trained logistic regression model for detection, and the detection result that the Android application to be detected is malicious or benign is obtained.

附图说明Description of drawings

图1为本发明的技术方案实施步骤流程图；Fig. 1 is a flowchart of implementation steps of the technical solution of the present invention;

图2为本发明中异构图注意力网络的结构示意图；Fig. 2 is a schematic structural diagram of a heterogeneous graph attention network in the present invention;

图3为本发明中元路径的结构示意图；Fig. 3 is a structural schematic diagram of the meta path in the present invention;

图4为本发明中元图的结构示意图。Fig. 4 is a schematic structural diagram of a metagraph in the present invention.

具体实施方式Detailed ways

附图仅用于示例性说明，不能理解为对本专利的限制；The accompanying drawings are for illustrative purposes only and cannot be construed as limiting the patent;

为了更好说明本实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；In order to better illustrate this embodiment, some parts in the drawings will be omitted, enlarged or reduced, and do not represent the size of the actual product;

对于本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。For those skilled in the art, it is understandable that some well-known structures and descriptions thereof may be omitted in the drawings.

下面结合附图和实施例对本发明的技术方案做进一步的说明。The technical solutions of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.

实施例1Example 1

如图1所示，一种基于异构图注意力网络的安卓恶意程序检测方法，包括以下步骤：As shown in Figure 1, a method for detecting Android malware based on a heterogeneous graph attention network includes the following steps:

在具体实施过程中，首先通过对APK反编译提取得到多种关键特征实体，根据安卓应用程序与关键特征实体之间的关系构建异构图注意力网络，并将异构图注意力网络转化为多个元结构，然后由元结构的邻接矩阵获取已有节点的低维向量嵌入，利用低维向量嵌入和标签训练逻辑回归模型，最后获取待检测的安卓应用程序的节点嵌入并输入训练好的逻辑回归模型进行检测，得到待检测的安卓应用程序为恶意或良性的检测结果。In the specific implementation process, firstly, a variety of key feature entities are extracted by decompiling the APK, and a heterogeneous graph attention network is constructed according to the relationship between the Android application program and the key feature entities, and the heterogeneous graph attention network is transformed into Multiple meta-structures, and then obtain the low-dimensional vector embedding of existing nodes from the adjacency matrix of the meta-structure, use the low-dimensional vector embedding and label to train the logistic regression model, and finally obtain the node embedding of the Android application to be tested and input the trained The logistic regression model is used for detection, and the detection result of whether the Android application program to be detected is malicious or benign is obtained.

实施例2Example 2

S1：下载安卓应用程序APP并进行标签，得到安卓应用程序集合；其中，安卓应用程序包括良性安卓应用程序和恶意安卓应用程序，良性安卓应用程序的标签为0，恶意安卓应用程序的标签为1；良性应用程序从Google Play商店下载，恶意应用程序从virusshare.com下载；S1: Download the Android application APP and label it to obtain a set of Android applications; where the Android application includes a benign Android application and a malicious Android application, the label of a benign Android application is 0, and the label of a malicious Android application is 1 ; Benign apps are downloaded from Google Play Store, malicious apps are downloaded from virusshare.com;

S2：利用反编译工具apktool对安卓应用程序的安装包APK进行反编译，并从反编译后的文件中提取得到多种关键特征实体；S2: Use the decompilation tool apktool to decompile the installation package APK of the Android application, and extract various key feature entities from the decompiled file;

更具体的，所述关键特征实体包括API、权限、权限类型、类、接口和so文件。More specifically, the key feature entities include API, authority, authority type, class, interface and so file.

如图2-4所示，图中A表示API，应用程序编程接口；P表示权限，规定应用程序执行的操作；T表示权限类型；C表示类，将共同属性和行为抽象为相对复杂的数据类型；I表示接口，是一种抽象的数据结构用来定义一个规范；S表示so文件，.so文件是安卓的动态链接库；其中A、C、I来源于反编译后的smali文件，P、T来源于反编译后的AndroidManifest.xml文件，S来源于反编译后的lib文件。As shown in Figure 2-4, A in the figure represents API, application programming interface; P represents permissions, specifying the operations performed by applications; T represents permission types; C represents classes, which abstract common attributes and behaviors into relatively complex data Type; I means interface, which is an abstract data structure used to define a specification; S means so file, and the .so file is the dynamic link library of Android; where A, C, and I come from decompiled smali files, and P , T comes from the decompiled AndroidManifest.xml file, and S comes from the decompiled lib file.

更具体的，根据安卓应用程序与关键特征实体之间的关系形成图内关系矩阵Rl_in，l∈[1,6]；其中，R1_in表示App与API之间的关系，R2_in表示App与权限之间的关系，R3_in表示App所属的权限类型，R4_in表示App与类之间的关系，R5_in表示App与接口之间的关系，R6_in表示App与so文件之间的关系。More specifically, the relationship matrix _Rlin in the graph is formed according to the relationship between Android applications and key feature entities, l∈[1,6]; where R1 _in represents the relationship between App and API, and R2 _in represents the relationship between App and The relationship between permissions, R3 _in indicates the type of permission to which the App belongs, R4 _in indicates the relationship between the App and the class, R5 _in indicates the relationship between the App and the interface, and R6 _in indicates the relationship between the App and the so file.

在具体实施过程中，对于R1_in，用a_ij∈(0,1)表示App_i是否含有API_j，如果是，那么a_ij＝1，否则，a_ij＝0；对于R2_in，用P_ij∈(0,1)表示App_i是否含有权限j,如果是，那么P_ij＝1，否则，P_ij＝0；对于R3_in，用T_ij∈(0,1)表示权限i是否属于类型j,如果是，那么T_ij＝1，否则，T_ij＝0；对于R4_in，用C_ij∈(0,1)表示App_i是否含有类j,如果是，那么C_ij＝1，否则，C_ij＝0；对于R5_in，用I_ij∈(0,1)表示App_i是否含有接口j,如果是，那么I_ij＝1，否则，I_ij＝0；对于R6_in，用S_ij∈(0,1)表示App_i是否含有.so文件j,如果是，那么S_ij＝1，否则，S_ij＝0。In the specific implementation process, for R1 _in , use a _ij ∈ (0,1) to indicate whether App _i contains API _j , if yes, then a _ij = 1, otherwise, a _ij = 0; for R2 _in , use P _ij ∈(0,1) indicates whether App _i contains authority j, if yes, then P _ij =1, otherwise, P _ij =0; for R3 _in , use T _ij ∈(0,1) to indicate whether authority i belongs to type j , if yes, then T _ij =1, otherwise, T _ij =0; for R4 _in , use C _ij ∈ (0,1) to indicate whether App _i contains class j, if yes, then C _ij =1, otherwise, C _ij = 0; for R5 _in , use I _ij ∈ (0,1) to indicate whether App _i contains interface j, if yes, then I _ij = 1, otherwise, I _ij = 0; for R6 _in , use S _ij ∈ ( 0,1) indicates whether App _i contains .so file j, if yes, then S _ij =1, otherwise, S _ij =0.

更具体的，所述异构图注意力网络为图G＝(V，E，A，R)，其节点的类型包括APP、API、权限、权限类型、类、接口和so文件，边的类型包括R1_in、R2_in、R3_in、R4_in、R5_in和R6_in；其中，V表示节点的集合，E表示边的集合，A表示节点的类型集，R表示边的类型集，|A|+|R|>2。More specifically, the heterogeneous graph attention network is a graph G=(V, E, A, R), the types of its nodes include APP, API, authority, authority type, class, interface and so file, and the type of edge Including R1 _in , R2 _in , R3 _in , R4 _in , R5 _in and R6 _in ; among them, V represents the set of nodes, E represents the set of edges, A represents the type set of nodes, R represents the type set of edges, |A| +|R|>2.

更具体的，所述元结构为元路径或元图，所述元路径是在异构图注意力网络上定义的路径，源对象和目标对象位于路径的两端，若源对象和目标对象之间有多条元路径则构成元图。More specifically, the meta-structure is a meta-path or a meta-graph. The meta-path is a path defined on a heterogeneous graph attention network. The source object and the target object are located at both ends of the path. If the source object and the target object There are multiple meta-paths between them to form a meta-graph.

更具体的，由K个元结构的邻接矩阵组成邻接矩阵集合{Ψ^M1,...,Ψ^Mk,...,Ψ^MK}，元结构的邻接矩阵为元路径的邻接矩阵或元图的邻接矩阵，More specifically, the adjacency matrix set {Ψ ^M1 ,...,Ψ ^Mk ,...,Ψ ^MK } is composed of K adjacency matrices of meta-structure, and the adjacency matrix of meta-structure is the adjacency matrix of meta-path or meta-graph adjacency matrix,

Ψ^MG＝Ψ^MP1⊙...⊙Ψ^MPj⊙...⊙Ψ^MPm；Ψ ^MG = Ψ ^MP1 ⊙...⊙Ψ ^MPj ⊙...⊙Ψ ^MPm ;

更具体的，步骤S4包括以下步骤：More specifically, step S4 includes the following steps:

Ψ^Mk’＝Normalize(H·H^T⊙Ψ^Mk)Ψ ^Mk' ＝Normalize(H·H ^T ⊙Ψ ^Mk )

更具体的，在步骤S5中，通过以下步骤获取待检测的安卓应用程序的节点嵌入：More specifically, in step S5, the node embedding of the Android application to be detected is obtained through the following steps:

S52：形成节点邻接矩阵的增量段

形式为j行列矩阵，j表示图内节点的个数，矩阵的第j行数值

S53：使用top-k算法对

其中，

表示v_s在元路径Mk上的权重，

表示新节点与图内邻居节点v_s之间元结构的数量。in,

Indicates the weight of v _s on the meta-path Mk,

更具体的，逻辑回归模型输出的预测值为：More specifically, the predicted value output by the logistic regression model is:

其中，b表示偏移参数，w表示权重，

a node embedding representing the Android application to be detected;

实施例3Example 3

显然，本发明的上述实施例仅仅是为清楚地说明本发明所作的举例，而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明权利要求的保护范围之内。Apparently, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, rather than limiting the implementation of the present invention. For those of ordinary skill in the art, other changes or changes in different forms can be made on the basis of the above description. It is not necessary and impossible to exhaustively list all the implementation manners here. All modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included within the protection scope of the claims of the present invention.

Claims

1. the Android malicious program detection method based on heterogeneous graph attention network, it is characterized in that, comprises the following steps:

S1: Download the Android application APP and label it to obtain a set of Android applications; wherein, the Android application includes a benign Android application and a malicious Android application;

S2: Decompile the installation package APK of the Android application program, and extract various key feature entities from the decompiled file;

S3: Construct a heterogeneous graph attention network based on the relationship between Android applications and key feature entities, transform the heterogeneous graph attention network into multiple meta-structures, and calculate the adjacency matrix of each meta-structure;

S4: Obtain the low-dimensional vector embedding of existing nodes according to the adjacency matrix of the meta structure;

S5: Use the low-dimensional vector embedding and label of the existing nodes to train the logistic regression model, obtain the trained logistic regression model, and obtain the node embedding of the Android application to be detected;

S6: Embed the nodes of the Android application to be detected into the trained logistic regression model for detection, and obtain a detection result indicating whether the Android application to be detected is malicious or benign.

2. the Android malicious program detection method based on heterogeneous graph attention network according to claim 1, is characterized in that, described key feature entity comprises API, authority, authority type, class, interface and so file.

3. the Android malicious program detection method based on heterogeneous graph attention network according to claim 2, it is characterized in that, according to the relationship between Android application program and key characteristic entity, form relation matrix Rlin _{in the} graph, l∈[ 1,6]; among them, R1 _in represents the relationship between App and API, R2 _in represents the relationship between App and authority, R3 _in represents the type of authority to which App belongs, R4 _in represents the relationship between App and class, R5 in _in indicates the relationship between App and interface, and R6 _in indicates the relationship between App and so file.

4. the Android malicious program detection method based on heterogeneous graph attention network according to claim 3, is characterized in that, described heterogeneous graph attention network is graph G=(V, E, A, R), its The type of node includes APP, API, permission, permission type, class, interface and so file, and the type of edge includes R1 _in , R2 _in , R3 _in , R4 _in , R5 _in and R6 _in ; where, V represents a collection of nodes, E represents the set of edges, A represents the type set of nodes, R represents the type set of edges, |A|+|R|>2.

5. the Android malicious program detection method based on heterogeneous graph attention network according to claim 1, is characterized in that, described metastructure is metapath or metagraph, and described metapath is in heterogeneous graph attention network The path defined above, the source object and the target object are located at the two ends of the path, if there are multiple meta-paths between the source object and the target object, a meta-graph is formed.

6. the Android malicious program detection method based on heterogeneous graph attention network according to claim 5, is characterized in that,

The adjacency matrix set {Ψ ^M1 , ..., Ψ ^Mk , ..., Ψ ^MK } is composed of K element-structured adjacency matrices,

The adjacency matrix of the meta-structure is the adjacency matrix of the meta-path or the adjacency matrix of the meta-graph,

Among them, the calculation formula of the adjacency matrix of the meta-path is:

Ψ ^MP ＝R _A1A2 ·...·R _AiA(i+1) ·...·R _A(n-1)An

The formula for calculating the adjacency matrix of the meta graph is:

Ψ ^MG = Ψ ^MP1 ⊙...⊙Ψ ^MPj ⊙...⊙Ψ ^MPm ;

Among them, Ψ ^Mk represents the adjacency matrix of the k-th element structure, R _AiA(i+1) represents the relationship matrix between the i-th node and the i+1-th node, i=1, 2,..., n , n represents the number of meta-path nodes, Ψ ^MPj represents the jth Ψ ^MP , ⊙ represents the Hadamard product, and m represents the number of Ψ ^MP .

7. the Android malicious program detection method based on heterogeneous graph attention network according to claim 6, is characterized in that, step S4 comprises the following steps:

S41: Encode each node in the form of a one-hot vector to obtain a matrix H, combine H with the adjacency matrix of a given metastructure Mk, and obtain the adjacency matrix of the internal nodes of the metastructure through a normalization operation:

Ψ ^Mk' ＝Normalize(H·H ^T ⊙Ψ ^Mk )

And introduce the GAT model of edge weight perception to update the meta-structure Mk internal node embedding Φ ^Mk = GAT(H; Ψ ^Mk' );

S42: Using a multi-layer perceptron to learn the weight β ^Mk of each meta-structure Mk in the fusion,

(β ^M1 ,...,β ^Mk ,...,β ^MK )=softmax(NN(Φ ^M1 ),...,NN(Φ ^Mk ),...,NN(Φ ^MK ))

Among them, NN is a native neural network that maps a given matrix to a numerical value,

To obtain the low-dimensional vector embedding of existing nodes:

8. the Android malicious program detection method based on heterogeneous graph attention network according to claim 7, is characterized in that, in step S5, obtains the node embedding of the Android application program to be detected by the following steps:

S51: Form an out-of-graph relationship matrix Rl _out according to the relationship between the Android application to be detected and the key feature entity, l∈[1,6];

S52: Form an incremental segment of the node adjacency matrix

S53: Use the top-k algorithm pair

put in order,

Indicates the number of metastructures between the new node and the node v _j in the graph, select the first t nodes in the graph with a larger value as the neighbor node v _s in the graph, s=1,2,...,t, aggregate new The vector of the node and the neighbor nodes in the graph, and the node embedding of the Android application to be detected is obtained:

in,

Indicates the weight of v _s on the meta-path Mk,

Indicates the number of metastructures between the new node and the neighbor node b _s in the graph.

9. the Android malicious program detection method based on heterogeneous graph attention network according to claim 1, is characterized in that, the predictive value of logistic regression model output is:

Among them, b represents the offset parameter, w represents the weight,

a node embedding representing the Android application to be detected;

When the predicted value a output by the logistic regression model is greater than 0.5, the detection result is malicious; otherwise, the detection result is benign.

10. the Android malicious program detection device based on heterogeneous graph attention network according to claim 1, is characterized in that, comprises:

The feature engineering module is used to label the APP, decompile the APK, and extract key feature entities;

Graph building blocks for constructing heterogeneous graph attention networks in the form of points and edges based on relationships between Android apps and key feature entities; also for transforming heterogeneous graph attention networks into multiple metastructures, and Calculate the adjacency matrix of each meta-structure;

The node aggregation module is used to obtain the node embedding of the Android application, and obtain the low-dimensional vector embedding of the node according to the adjacency matrix of the meta structure;

The detection module is used to learn and classify through the low-dimensional vector embedding and label of the node, and detect according to the node embedding of the Android application to be detected, and output the detection result that the Android application to be detected is malicious or benign.