CN112749391B

CN112749391B - Detection method and device for malware countermeasure sample and electronic equipment

Info

Publication number: CN112749391B
Application number: CN202011630878.5A
Authority: CN
Inventors: 李珩; 袁巍; 尹路飞; 佟萌
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2024-04-09
Anticipated expiration: 2040-12-31
Also published as: CN112749391A

Abstract

The invention discloses a method and a device for detecting a malicious software countermeasure sample and electronic equipment, belonging to the field of android software ecology, wherein the method comprises the following steps: s1: extracting a multi-granularity function call graph from each normal APK sample; s2: training a corresponding variogram self-encoder for each granularity function call graph based on normal APK samples, wherein the variogram self-encoder comprises an encoder and a decoder; s3: constructing an anti-sample detection model for each granularity by using a variogram self-encoder; the antagonism sample detection model is used for learning the data distribution of the APP normal sample from each granularity; s4: and inputting the detection sample into the countersample detection model after training is finished, and judging a detection result according to the hidden variable output by the encoder and the reconstruction result corresponding to the decoder. According to the invention, the normal sample is used for training the anti-sample detection model, so that the malicious software detection of one granularity is improved to the malicious software detection of a plurality of granularities, and the malicious software detection accuracy can be improved.

Description

Detection method and device for malware countermeasure sample and electronic equipment

Technical Field

The invention belongs to the field of android software ecology, and particularly relates to a method and a device for detecting a malicious software countermeasure sample and electronic equipment.

Background

With the rapid development of the mobile internet, mobile devices such as mobile phones gradually become a main tool for people to surf the internet. Among these mobile devices, android (Android) occupies about 87% of the share, which is currently the most mainstream operating system. However, the openness, vulnerability and imperfect application market censoring mechanisms of the Android system also result in massive growth and widespread of Malware (Malware).

In order to cope with endlessly layered Android malicious software, various Android malicious software detection methods based on machine learning are continuously emerging, the detection performance is continuously increased, and the F1 score of part of detection methods is even close to 0.99. However, these android malware face a serious threat, namely against the sample (advssarialexample). The challenge sample is obtained by challenge disturbance based on the original sample xDynamically generated new sample x which can spoof classifier F ^* Can be expressed as:

x ^* ＝x+δ _x ＝x+min||x ^* -x||,s.t.F(x ^* )≠F(x)

wherein delta _x Is a very small interference value for disturbing the original sample x. In order to make classification erroneous, the challenge sample may be generated according to the following equation: x is x _adv ∈argmaxL(θ，x ^* Y), L is the loss function of the attack object, and y is the label of sample x.

The antagonism sample is formed by adding a fine perturbation to the normal sample (also known as the natural sample) that can cause the machine learning model to give a false classification result with high confidence. And applying elaborate disturbance to samples of Android malicious software (APP malicious samples for short) to deceive the detection system. The countermeasure sample provides a brand-new technical means capable of escaping detection for Android malicious software manufacturers, and brings great threat to the existing detection system.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a method, a device and electronic equipment for detecting a malicious software countermeasure sample, and aims to provide the method for detecting the countermeasure sample based on an android APP system, which is used for training a countermeasure sample detection model by using a normal sample, so that the malicious software detection of one granularity is improved to the malicious software detection of a plurality of granularities, the malicious software detection accuracy can be improved, and the technical problem that the recognition rate of the conventional detection method for the countermeasure sample is low is solved.

To achieve the above object, according to one aspect of the present invention, there is provided a method for detecting a malware challenge sample, including:

s1: extracting a multi-granularity function call graph from each normal APK sample, wherein the granularity corresponding to the multi-granularity function call graph at least comprises: family, class, package and Function; each granularity function call graph comprises node information and side information;

s2: training a corresponding variogram self-encoder for each granularity function call graph based on the normal APK samples, wherein the variogram self-encoder comprises an encoder and a decoder;

s3: constructing an challenge sample detection model for each granularity using the variogram self-encoder; the countermeasure sample detection model is used for learning the data distribution of the APP normal samples from each granularity;

s4: and inputting the detection sample into the countersample detection model after training is finished, and judging a detection result according to the hidden variable output by the encoder and the reconstruction result output by the decoder.

In one embodiment, the step S1 includes:

s101: generating a smali file by a decompilation tool from an original file corresponding to the normal APK sample;

s102: and extracting a function call relation from the smali file to form a function call graph with each granularity.

In one embodiment, the call graph of each granularity includes node information and side information, and the step S102 includes:

extracting Function call relations from the smali file by utilizing four dimensions of Family, class, package and Function and characterizing the multi-granularity Function call graphs, which are respectively denoted as G _function 、G _class 、G _package And G _family ；

Using one-hot code to represent each node in the function call graph with each granularity; the node information reflects semantic information of each node; the side information corresponding to the side between the two nodes reflects the topological relation of the function call graph and corresponds to the call relation between the functions in the APK.

In one embodiment, the step S2 includes:

inputting each granularity function call graph corresponding to the normal APK sample into an encoder corresponding to the variation graph self-encoder so that the encoder outputs a feature matrix corresponding to Gaussian distribution; the encoder is built based on a graph rolling network GCN;

when the feature matrix is input into a decoder, hidden variables are obtained by sampling from the generated Gaussian distribution, and then the hidden variables are decoded by inner product operation to reconstruct samples.

In one embodiment, the mean μ and variance σ of the gaussian distribution are:

wherein X is a feature matrix of a semantic feature map with one granularity, A is an adjacent matrix, D is a degree matrix, reLU is an activation function, and the generation process of the mean and variance shares W ₀ Parameters;

the reconstructed adjacent matrix obtained by decoding and sample reconstruction by using the inner product operation is marked as:

in one embodiment, the loss function of the challenge sample detection model is defined as:

L＝-E _q(Z|X，A) [logp(A|Z)]+KL[q(Z|X，A)|p(Z)]；

wherein q (Z|X, A) is the distribution calculated by GCN, p (Z) is the standard Gaussian distribution, E _q(Z|X,A) KL [ q (z|x, a) |p (Z) for decoder to generate a expectation of a]KL divergence of the distribution generated for the decoder and the standard gaussian distribution.

In one embodiment, the step S4 includes:

s401: inputting a detection sample into a challenge sample detection model after training is finished so that the encoder outputs hidden variables, and comparing the mean value and variance of the hidden variables with normal distribution to obtain a detection result;

s402: inputting a detection sample into an countermeasure sample detection model after training is completed, and comparing a reconstruction result output by the countermeasure sample detection model with the detection sample; when a detection sample is input into the countermeasure sample detection model, if the difference between the reconstruction result and the sample is larger than a preset value, the sample reconstruction is failed, namely the sample is a countermeasure sample; and the difference between the reconstruction result and the sample is smaller than or equal to the preset value, which indicates that the sample is successfully reconstructed, namely the sample is a normal sample.

In one embodiment, the step S401 includes:

comparing the reconstruction result corresponding to one granularity with the detection sample to obtain a first test result aiming at each granularity, wherein the first test result is as follows:

obtaining a second test result by utilizing the difference between the output distribution of the encoder and the standard normal distribution, wherein the second test result is as follows:

performing OR operation on the first test result and the second test result to obtain a detection result corresponding to the granularity,

wherein sign is a sign function, G _i Graph data representing a granularity of G _i ' represents reconstructed data of a certain granularity, thr1 and thr2 represent preset thresholds;

the detection result r corresponding to each granularity _function 、r _class 、r _package And r _family And performing OR operation to obtain the detection result r.

According to another aspect of the present invention, there is provided a detection apparatus for a malicious application of security Zhuo Duan, comprising:

the extracting module is used for extracting a multi-granularity function call graph from each normal APK sample, and the granularity corresponding to the multi-granularity function call graph at least comprises: family, class, package and Function; each granularity function call graph comprises node information and side information;

the training module is used for training a corresponding variogram self-encoder for each granularity function call graph based on the normal APK sample, and the variogram self-encoder comprises an encoder and a decoder;

a modeling module for constructing an challenge sample detection model for each granularity using the variogram self-encoder; the countermeasure sample detection model is used for learning the data distribution of the APP normal samples from each granularity;

and the test module is used for inputting the detection sample into the countersample detection model after training is completed, and judging the detection result according to the hidden variable output by the encoder and the reconstruction result output by the decoder.

According to another aspect of the present invention there is provided an electronic device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the detection method as described.

In general, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

1. the invention provides an anti-sample detection method based on the android APP field, which utilizes a normal sample to train an anti-sample detection model, improves the detection of malicious software with one granularity into the detection of malicious software with a plurality of granularities, and can improve the detection accuracy of the malicious software.

2. According to the invention, single class classification (one-class) is used for training an antagonistic sample detection model, so that dependence on a negative sample in the training process is reduced, and the generalization capability of the model is improved; in addition, the invention uses the discrete disturbance countermeasure sample to test the countermeasure sample detection model, can test the detection accuracy, and further can improve the detection performance of the countermeasure sample detection model.

3. According to the invention, the output distribution and the reconstruction result are obtained by inputting the detection sample into the countermeasure sample detection model after training is completed, the output distribution and the standard normal distribution are respectively compared with the reconstruction result and the detection sample from each granularity to obtain the difference information corresponding to each granularity, and finally the difference information corresponding to each granularity is processed or operated, and when the difference degree of any granularity is larger than the difference threshold value, the countermeasure sample can be judged, and the detection accuracy of the countermeasure sample can be improved.

Drawings

FIG. 1 is a flowchart of a method for detecting a malware challenge sample according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a process of extracting a function call graph from an APK according to an embodiment of the invention;

FIG. 3 is a diagram illustrating a function call graph with different granularity according to an embodiment of the present invention;

fig. 4 is a flow chart of implementing multi-granularity APP challenge sample detection based on a variogram-based self-encoder in an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

According to one aspect of the present invention, as shown in fig. 1, there is provided a method for detecting a malware challenge sample, including:

s1: extracting a multi-granularity function call graph from each normal APK sample, wherein the granularity corresponding to the multi-granularity function call graph at least comprises: family, class, package and Function; the function call graph of each granularity includes node information and side information.

S2: training a corresponding variogram self-encoder for each granularity function call graph based on normal APK samples, wherein the variogram self-encoder comprises an encoder and a decoder.

In one embodiment, step S1 includes: s101: and generating a smali file from the original file corresponding to the normal APK sample through a decompilation tool. S102: and extracting function call relations with different granularities from the smali file to respectively form a function call graph with each granularity.

Specifically, as shown in fig. 2, the process of generating the smali file and the function call graph with each granularity includes: 1. generating various files from the original APK through a decompilation tool; 2. extracting a function call relation from the generated smali file to form a most basic function call graph; 3. as functions may be represented by different granularities. As shown in FIG. 2, the entire Function call graph is characterized by four dimensions, family, class, package, function, respectively. As shown in FIG. 3, the function call graph may be mapped with four new graphs, G _function 、G _class 、G _package And G _family And (3) representing.

It should be noted that, the call graph of each granularity is divided into two important components, including node information and edge information. The side information reflects the topological relation of the graph and corresponds to the calling relation among functions in the APK; the node information represents semantic information of each node. The generation of the challenge sample often breaks the semantic information of each node in the normal APK, so each node in the function call graph can be represented by a one-hot code, which provides effective support for the subsequent detection of the challenge sample.

After the vector of APP samples is acquired, a detection method needs to be designed to distinguish between challenge samples and normal samples. However, the generation of APP against samples is diverse and evolving, and it is difficult to propose a detection scheme for each method separately. On the other hand, it is difficult to obtain a large number of challenge samples in practice. Therefore, the conventional challenge sample training method cannot be used. To overcome these difficulties, the present invention proposes to detect APP challenge samples from multiple granularities based on a single class classification (one-class classification) model. The method has the advantages that only APP normal samples (including benign and malignant samples and commonly called normal samples) are needed to train the classification model, the method is still effective for various types of even unknown types of challenge samples, and meanwhile the characteristic of multiple granularities of the android malicious software characteristics is ingeniously utilized to resist the attack of the challenge samples from the characteristic layers.

In one embodiment, each granularity call graph contains node information and side information, and step S102 includes: extracting Function call relations from smali files by utilizing four dimensions of Family, class, package and Function and characterizing a multi-granularity Function call graph, which are respectively denoted as G _function 、G _class 、G _package And G _family . Each node in the function call graph representing a respective granularity is encoded with one-hot. The node information reflects semantic information of each node. The side information corresponding to the side formed by the two nodes reflects the topological relation of the function call graph and corresponds to the call relation among the functions in the APK.

In one embodiment, as shown in fig. 4, step S2 includes: and inputting each granularity function call graph corresponding to the normal APK sample into an encoder corresponding to the variable-division graph self-encoder so that the encoder outputs a characteristic matrix corresponding to Gaussian distribution. The encoder is built on the basis of the graph roll network GCN. The feature matrix is input into a decoder, hidden variables are obtained by sampling from the generated Gaussian distribution, and then the inner product operation is used for decoding to reconstruct samples.

The present invention does not directly distinguish between APP challenge samples and normal samples, but rather, maps samples to a hidden space (space) before comparing and distinguishing vector representations thereof. This procedure tends to use a related method of graph representation learning or graph embedding. The Graph convolutional neural network (Graph ConvolutionalNetwork, GCN) is a Graph representation learning method. The method is a natural popularization of the convolutional neural network on graph data, and can perform end-to-end learning on node attribute information and topology structure information at the same time. GCN is significantly superior to other methods in many tasks such as node classification and edge prediction, so complex graph data is represented in a vector form that can be processed by a machine learning model using GCN.

Specifically, a challenge sample detection model is constructed for each granularity using a variogram self Encoder (Variational Graph Auto-Encoder, VGAE). The proposal of this method is mainly based on two considerations: 1) In the low-dimensional hidden space, the normal sample is obviously distinguished from the countermeasure sample; 2) Compared with a normal sample, the reconstruction difficulty of the countersample is higher; 3) Even if a certain granularity of the model is defended against sample attacks, the model can defend against samples from other granularities well. The basic idea is as follows: firstly, training a variogram self-encoder by using a normal sample as a function call diagram of each granularity, so that the normal sample can be better reconstructed; once the model training is completed, the model reconstruction will fail once the APP challenge sample is entered, and it is difficult to fall into the distribution preset in the training phase in the low-dimensional hidden space.

In one embodiment, the mean μ and variance σ of the gaussian are:

wherein X is the feature matrix of one granularity semantic feature graph of the input, A is the adjacency matrix, D is the degree matrix, reLU is the activation function, and the generation process of the mean and variance shares W ₀ Parameters;

and decoding by using the inner product operation to reconstruct samples to obtain a reconstructed adjacent matrix, which is marked as:

L＝-E _q(Z|X，A) [logp(A|Z)]+KL[q(Z|X，A)|p(Z)]；

Specifically, the challenge sample detection model can understand the data distribution of the APP normal sample from each granularity, and can better keep the difference between the challenge sample and the normal sample in the hidden space. The invention proposes a method of measuring the difference between an antagonistic sample and a normal sample for data distribution in hidden space (called hidden layer distribution). Accordingly, the challenge sample need only be identified based on the reconstruction effect of the model and the hidden layer distribution. In particular, inputs that lead to poor reconstruction effects and excessive hidden layer distribution differences may be identified as challenge samples. The method does not need to obtain priori knowledge about the challenge sample in advance, has stronger generalization capability, and can effectively detect the challenge sample of different types and even unknown types.

In one embodiment, step S4 includes: s401: inputting a detection sample into a challenge sample detection model after training is finished so that the encoder outputs hidden variables, and comparing the mean value and variance of the hidden variables with normal distribution to obtain a detection result;

Specifically, in the specific detection, there are two points of discrimination indexes, 1) if the output mean μ and variance σ obtained for the encoder are smaller, the smaller the difference between the output mean μ and variance σ and the normal distribution is, the lower the probability that the app is an countermeasure sample is; 2) If the difference value between the input graph G and the output graph G' obtained during training of the preset detector is smaller, the graph reconstruction effect is better, the probability that the app is a challenge sample is lower, and otherwise, the app is judged to be the challenge sample. Therefore, whether the sample is a countermeasure sample is determined by using the difference between the encoder output distribution and the standard normal distribution and the difference value between the input to-be-detected graph and the output reconstructed graph.

In one embodiment, step S402 includes:

for each granularity, comparing a reconstruction result corresponding to one granularity with a detection sample to obtain a first test result, wherein the first test result is as follows:

wherein sign is a sign function, G _i Graph data representing a granularity of G _i ' represents reconstructed data of a certain granularity, thr1 and thr2 represent preset thresholds, and i represents granularity type. thr1 and thr2 represent preset thresholds, which may be empirically determined; the difference of the normal samples may also be calculated, with the maximum value of the sample difference set as the threshold. r is (r) _i =1 represents a determination that a certain apk is discriminated as a challenge sample at this granularity.

The detection result r corresponding to each granularity _function 、r _class 、r _package And r _family And performing OR operation to obtain a detection result r. Specifically, all discrimination results are integrated in an OR mode, so long as a certain granularity is adoptedA challenge sample is identified and the sample is identified as a challenge sample.

the training module is used for training a corresponding variable-division graph self-encoder for each granularity function call graph based on the normal APK sample, and the variable-division graph self-encoder comprises an encoder and a decoder;

a modeling module for constructing an antagonistic sample detection model for each granularity using a variational graph self-encoder; the antagonism sample detection model is used for learning the data distribution of the APP normal sample from each granularity;

According to another aspect of the present invention, there is provided an electronic device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the detection method as described.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method for detecting a malware challenge sample, comprising:

s1: extracting a multi-granularity function call graph from each normal APK sample, wherein the granularity corresponding to the multi-granularity function call graph comprises the following steps: family, class, package and Function; each granularity function call graph comprises node information and side information;

s4: inputting the detection sample into a countersample detection model after training is completed, and judging a detection result according to the hidden variable output by the encoder and the reconstruction result output by the decoder;

the S1 comprises the following steps: s101: generating a smali file by a decompilation tool from an original file corresponding to the normal APK sample; s102: extracting function call relations from the smali file to form a function call graph of each granularity; the S102 includes: extracting Function call relations from the smali file by utilizing four dimensions of Family, class, package and Function and characterizing the multi-granularity Function call graphs, which are respectively denoted as G _function 、G _class 、G _package And G _family The method comprises the steps of carrying out a first treatment on the surface of the Using one-hot code to represent each node in the function call graph with each granularity; the node information reflects semantic information of each node; the side information corresponding to the side between the two nodes reflects the topological relation of the function call graph and corresponds to the call relation between functions in the APK;

the step S2 comprises the following steps: inputting a function call graph of each granularity corresponding to the normal APK sample into an encoder corresponding to the variation graph self-encoder so that the encoder outputs a feature matrix corresponding to Gaussian distribution; the encoder is built based on a graph roll-up neural network GCN; inputting the feature matrix into a decoder, sampling from the generated Gaussian distribution to obtain hidden variables, and decoding by using inner product operation to reconstruct samples; the mean mu and variance sigma of the gaussian distribution are:

x is a feature matrix of a semantic feature map with one granularity, A is an adjacent matrix, D is a degree matrix, reLU is an activation function, and the generation process of the mean and the variance shares W ₀ Parameters;

the reconstructed adjacent matrix obtained by decoding and sample reconstruction by using the inner product operation is marked as:ε～N(0，1)；

the loss function of the challenge sample detection model is defined as:

L＝-E _q(Z|X，A) [logp(A|Z)]+KL[q(Z|X，A)|p(Z)]；

q (Z|X, A) is the distribution calculated by GCN, p (Z) is the standard Gaussian distribution, E _q(Z|X,A) KL [ q (z|x, a) |p (Z) for decoder to generate a expectation of a]KL divergence of the distribution generated for the decoder and the standard gaussian distribution;

the step S4 comprises the following steps: s401: inputting a detection sample into a challenge sample detection model after training is finished so that the encoder outputs hidden variables, and comparing the mean value and variance of the hidden variables with normal distribution to obtain a detection result; s402: inputting a detection sample into an countermeasure sample detection model after training is completed, and comparing a reconstruction result output by the countermeasure sample detection model with the detection sample; when a detection sample is input into the countermeasure sample detection model, if the difference between the reconstruction result and the sample is larger than a preset value, the sample reconstruction is failed, namely the sample is a countermeasure sample; the difference between the reconstruction result and the sample is smaller than or equal to the preset value, which indicates that the sample is successfully reconstructed, namely the sample is a normal sample;

the S401 includes: comparing the reconstruction result corresponding to one granularity with the detection sample to obtain a first test result aiming at each granularity, wherein the first test result is as follows:

i＝{function，class，package，family}；

i= { function, class, package, family }; performing OR operation on the first test result and the second test result to obtain a detection result corresponding to the granularity, wherein the detection result is->i＝{function，class，package，family}；

Wherein sign is a sign function, G _i Graph data representing a granularity of G _i ' represents reconstructed data of a certain granularity, thr1 and thr2 represent preset thresholds; the detection result r corresponding to each granularity _function 、r _class 、r _package And r _family And performing OR operation to obtain the detection result r.

2. A detection apparatus for a Zhuo Duan malicious application, configured to perform the method for detecting a malware challenge sample of claim 1, comprising:

3. An electronic device comprising a memory and a processor, wherein the memory stores a computer program that, when executed by the processor, causes the processor to perform the steps of the detection method of claim 1.