CN115510445A

CN115510445A - Android malicious program detection method based on deep learning

Info

Publication number: CN115510445A
Application number: CN202211308160.3A
Authority: CN
Inventors: 孙钦东; 王艳; 王伟; 杨志海; 许航
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2022-12-23

Abstract

The invention discloses a deep learning-based Android malicious program detection method, which comprises the following steps of: collecting malicious Android AKP samples and normal AKP samples, outputting decompiled files through a decompiling tool, extracting the authority characteristics, API characteristics and 3-Gram characteristics of the decompiled files, converting the authority characteristics into characteristic vectors serving as training receiving samples, obtaining the local characteristics of the samples through CNN for the established CNN-BilSTM-orientation model, obtaining the global characteristics through BilSTM, and reducing noise interference and highlighting important characteristics by adopting an Attention mechanism in the BilSTM; and finally, performing fusion training on the two types of characteristics, and then detecting the malicious program by adopting a trained CNN-BilSTM-Attention model. The method solves the problem of insufficient coverage caused by using single characteristics, and improves the prediction accuracy of the model.

Description

Android malicious program detection method based on deep learning

Technical Field

The invention belongs to the technical field of deep learning and malicious program detection, and relates to an Android malicious program detection method based on deep learning.

Background

With the continuous development of internet technology and the upgrading of smart phones, the smart phones are slowly changed from luxury goods to mass consumer goods of one part of the hands, and become portable intelligent mobile devices integrating multiple functions such as internet browsing, video chatting, photographing and listening to songs, travel health codes and the like, and smart phones with Android operating systems are popular among people due to higher cost performance. While the Android operating system is rapidly increased, malicious application software is also in a variety, and an attacker develops malicious software capable of stealing privacy, telephone fee deduction, remote control and the like by utilizing the openness of the Android system, the non-standardization of an application market and the like, so that the benefits of users are seriously damaged. Therefore, efficient and accurate Android malicious program detection is of great significance.

In order to improve the accuracy rate of Android malware detection, a deep learning algorithm for accurate feature extraction and adaptation is the key of a training model. The apk features comprise static features and dynamic features, the static features can be directly extracted from a decompiled apk file, the detection speed is high, but the detection accuracy rate is not high for Android software which is subjected to code confusion and uses a digital signature with an error format; the dynamic characteristics need to extract software behavior characteristics in the software running process, can extract various data and is not influenced by technologies such as code confusion, but the detection speed is slow, so that appropriate representative characteristics need to be selected. Machine learning is used as a common classification method and is commonly used in Android malicious program detection, such as Random Forest (Random Forest), support Vector Machine (SVM), and the like, but the characteristics required to be analyzed during training and testing a model are less in machine learning, shallow features of a sample are obtained, the features are manually extracted and then optimized, only the features are concerned about and context is ignored during feature analysis, and although the calculation speed of machine learning is high, the detection accuracy of a machine learning algorithm is low due to the problems.

The single CNN model mines local features of the sample, does not consider the relationship between words and contexts, and does not assign different attention according to the importance degree of the features to classification, so that the efficiency of the model and the classification accuracy are reduced.

Disclosure of Invention

The invention aims to provide a method for detecting Android malicious programs based on deep learning, which solves the problem that coverage is not wide enough due to the use of single features, adopts a multi-model mixed deep learning technology, can extract global context features while extracting local features of a sample, and can distribute attention of different degrees to different feature information due to the introduction of an attention mechanism so as to improve the prediction accuracy of a model.

The technical scheme adopted by the invention is as follows:

a deep learning-based Android malicious program detection method comprises the following specific steps:

step 1, collecting malicious Android AKP samples and normal AKP samples, and outputting decompilated files through a decompilation tool;

step 2, extracting authority features, API features and 3-Gram features in the decompiled file, and converting the authority features, the API features and the 3-Gram features into feature vectors;

step 3, establishing a CNN-BilSTM-orientation model, and training the CNN-BilSTM-orientation model by using the characteristic vector obtained in the step 2 as a sample set;

the CNN-BilSTM-Attention model comprises an input layer, a word embedding layer, a bidirectional long-short term memory (BilSTM) module fused with an Attention mechanism, a convolution layer, a full connection layer and a Softmax layer; during model training, firstly, word2vec is used for carrying out feature vectorization processing on a Word embedding layer, the CNN carries out convolution operation on a feature vector matrix, max-posing is used for extracting a maximum value and recombining feature vectors to obtain local features; then, bi-directional dependency is captured by using a BilSTM, global context information is extracted from the characteristic vector matrix, and an Attention mechanism Attention is used for highlighting important characteristics; calculating final output, namely calculating a context hidden state, obtaining attention probability by using the context hidden state and the hidden state at the previous moment, weighting and summing the context hidden state through the probability, and obtaining final output after transformation; fusing the characteristics in the two models, introducing an activation function in order to enable the network to express nonlinear mapping, and classifying application samples by using Softmax;

and 4, for the program to be detected, obtaining the characteristic vector by adopting the method in the step 2, and inputting the characteristic vector into the trained CNN-BilSTM-orientation model for detection to obtain a detection result.

The invention is also characterized in that:

specifically, the step 1 is that a decompiling command apktool.bat d-f [ apk file path ] [ output folder path ] generates a decompiling file comprising an android manifest.xml file and a smali file.

The method for extracting the authority features comprises the following steps: extracting a statement containing permission from a < uses-permission > tag of an android manifest file, wherein the statement is a permission characteristic; the extraction method of the API characteristics comprises the following steps: traversing rows starting with 'invoke' in the smal file for extraction; the extraction method of the 3-Gram characteristics comprises the following steps: the Dalvik instruction classification rule is used in the smal file, instructions which do not affect the classification result are removed, the instructions are simplified into instructions capable of expressing semantics, classification is carried out according to the functions of the instructions, the instructions with the same or similar functions are classified into one class, and seven letter labels are used for representing the simplified class of instructions. And acquiring data by using a sliding window with the length of 3 in the 3-Gram model, and changing one statement into a plurality of segments with the length of N, namely the 3-Gram features.

The training process of the CNN-BilSTM-Attention model is as follows:

firstly, taking a sample set as input, and then converting sample characteristics into a characteristic vector at an embedding layer by using Word2 vec; secondly, the convolutional neural network selects 128 convolutional kernels with the sizes of 3, 4 and 5 respectively, convolution processing is carried out on the feature vector matrix in the convolutional layer, the maximum value is extracted by using a maximum pooling method (max-pooling) and the feature vector is formed again to obtain the local feature C _ccn (ii) a Obtaining global context information feature C by using BilSTM network introducing attention mechanism to feature vector input _blistm Finally, the full connection layer fuses the two characteristics and introduces an activation functionAnd training by adopting a gradient descent algorithm Adam, and classifying the application samples by using Softmax.

In the process of convolutional neural network training, the size of a convolutional kernel is assumed to be h X v, h is the height of the convolutional kernel, v is the dimension of a word vector, w is a weight matrix, and X is _n+h-1 Is a vector matrix, b is a bias term, f is an activation function, and the convolution calculation formula is as follows:

C _i ＝f(w*X _i:i+h-1 +b)

selecting the maximum value of the characteristic vector by a CNN convolution maximum pooling method (max-pooling), wherein the maximum pooling calculation formula is as follows:

C _cnn ＝max{C _i }。

the process of extracting global context information using the bidirectional long short term memory network (BilSTM) model is: if it is

In order to have the positive sequence LSTM,

if the order is reverse LSTM, the output state of the BilSTM is a combination of positive and reverse orders, and the calculation formula of the BilSTM is as follows:

different characteristics have different degrees of influence on classification, and in order to highlight important characteristics and weaken the degree of influence of irrelevant characteristics, an Attention mechanism is introduced at the same time, and different degrees of Attention are allocated to different characteristic information. Let T be the length of the input vector, h _j Representing the hidden layer state of the input vector after decoding, attention probability alpha _ij Representing an input word x _j For the current word y _i Attention probability of (2), then the global context information feature after attention is assigned

The calculation formula is as follows:

wherein

From the previous moment state S _i-1 And h _j And (4) calculating.

The invention has the beneficial effects that:

1) A plurality of representative features are selected, so that the coverage of the features on the apk file is wider, and the influence of code confusion on the features is reduced.

2) A multi-model fusion method in deep learning is selected, local features of a sample are obtained through CNN, biLSTM obtains global features, and compared with LSTM which only can learn forward information, biLSTM can better capture bidirectional dependence, so that the trained model is closer to the context, and an attention mechanism is adopted to reduce noise interference and highlight important features. Compared with the traditional machine learning, deep learning can be used for excavating deep features of a sample, manual optimization processing is not needed, a classification model with high detection rate can be trained through self learning of a neural network, and the classification accuracy rate is remarkably improved compared with a single classification model in the deep learning.

Drawings

FIG. 1 is a diagram of the structure of the CNN-BilSTM-Attention model of an example of the method of the present invention.

FIG. 2 is a flow chart of the detection of apk samples by the method of the present invention.

FIG. 3 is a loss rate (loss) graph of example 1 of the present invention;

FIG. 4 is a graph of accuracy (acc) of example 1 of the present invention

FIG. 5 is a graph comparing experimental results of examples of the method of the present invention with different deep learning models such as LSTM, bilSTM, CNN-BilSTM.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention discloses a deep learning-based Android malicious program detection method, which comprises the following specific steps of:

step 1, collecting malicious Android AKP samples and normal AKP samples, and outputting decompilated files through a decompilation tool; specifically, a decompilation file comprising android manifest, xml file, smali file and the like is generated in a decompilation command apktool.

Step 2, extracting authority features, API features and 3-Gram features in the decompiled file, and converting the authority features, the API features and the 3-Gram features into feature vectors; the method specifically comprises the following steps:

the decompiled android manifest xml file contains all permissions of the application, each permission written into the configuration file is wrapped by a < uses-permission > tag and contains an android.

The decompiled smali file contains all the APIs of the application, the content contained in the keywords ". Method" and ". End method" in the smali file is called by a method, and the API call is started from the "invoke-". Therefore, by traversing the lines in the smali file beginning with "invoke-", API features can be extracted.

The decompiled smali file also contains all Dalvik byte codes applied, dalvik instruction classification rules are used, instructions which do not influence classification results are removed, the Dalvik instruction classification rules are simplified into instructions capable of expressing semantics, classification is carried out according to the functions of the instructions, the instructions with the same or similar functions are classified into one class, and seven letter labels are used for representing the simplified class of instructions. And acquiring data by using a sliding window with the length of 3 in the 3-Gram model, and changing one sentence into a plurality of segments with the length of N to extract the 3-Gram features.

Connecting the obtained authority features and API feature 3-Gram features in series to form a sample set, cutting feature dimensions into uniform lengths of 200, and performing truncation when the length exceeds 200Operation, less than 200 make-0 operations, convert the features of each sample into X _n ＝(x ₁ ,x ₂ ,…,x _d ). Namely the input of the CNN-BilSTM-Attention model.

And 3, establishing a CNN-BilSTM-orientation model, wherein the model comprises an input layer, a word embedding layer, a bidirectional long-short term memory (BilSTM) module fused with an Attention mechanism, a convolution layer, a full connection layer and a Softmax layer. Specifically, CNN and BiLSTM are used to obtain local features and global context features, respectively, and the Attention makes the model focus more on important features. When the model is trained, firstly, word2vec is used for carrying out feature vectorization processing on a Word embedding layer, the CNN carries out convolution operation on a feature vector matrix, max-posing is used for extracting a maximum value, and the maximum value is recombined into a feature vector to obtain local features. The bi-directional dependencies are then captured using BilSTM, global context information is extracted for the feature vector matrix, and the Attention mechanism Attention is used to highlight important features. Calculating final output, namely calculating a context hidden state, obtaining attention probability by using the context hidden state and the hidden state at the previous moment, weighting and summing the context hidden state through the probability, and obtaining final output after transformation; fusing the features in the two models, in order to enable the network to express nonlinear mapping, an activation function needs to be introduced, and classification of application samples is achieved by using Softmax.

In order to ensure the accuracy of the prediction result of the CNN-BilSTM-Attention model, the sample set obtained in the step 2 is adopted to train the CNN-BilSTM-Attention model. The specific training process is as follows:

taking a sample set as input, and converting sample characteristics into a characteristic vector at an embedding layer by using Word2 vec; and respectively inputting the feature vectors into a Convolutional Neural Network (CNN) network and a bidirectional long-short term memory (BilSTM) network introducing an attention mechanism to respectively extract local features and acquire global context information features, then fusing the two obtained features, and classifying the application samples by using Softmax.

Then, the Convolutional Neural Network (CNN) selects 128 convolutional kernels with the sizes of 3, 4 and 5 respectively, convolution processing is carried out on the feature vector matrix in the convolutional layer, the maximum value is extracted by using a maximum pooling method (max-pooling) and the feature vectors are recombined to obtain local features:

taking the obtained characteristic vector matrix as the input of CNN, performing convolution processing on the characteristic vector matrix in a convolution layer, and assuming that the size of a convolution kernel is h multiplied by v, h is the height of the convolution kernel, v is the dimension of a word vector, w is a weight matrix, and X is the weight matrix _n+h-1 Is a vector matrix, b is a bias term, f is an activation function, and the convolution calculation formula is as follows:

C _i ＝f(w*X _i:i+h-1 +b)

C _cnn ＝max{C _i }

meanwhile, a BilSTM bidirectional long-short term memory network (BilSTM) introducing attention mechanism is used for extracting global context information, so that the trained model is not only related to the words but also has a closer relation to the context. If it is

In order to have the positive sequence LSTM,

different characteristics have different degrees of influence on classification, and in order to highlight important characteristics and weaken the degree of influence of irrelevant characteristics, an Attention mechanism is introduced at the same time, and different degrees of Attention are allocated to different characteristic information. Let T be the length of the input vector, h _j Representing the hidden layer state, attention probability alpha, of the input vector after decoding _ij Representing an input word x _j For the current word y _i Attention probability of (2), then the global context information feature after attention is assigned

The calculation formula is as follows:

when fusing features in the two models, the sample feature may be represented as C = { C _cnn ,C _bilstm In order to make the network express nonlinear mapping, an activation function needs to be introduced, and Softmax is used to realize classification of application samples. And (3) optimizing the loss by using a gradient descent algorithm Adam, repeatedly adjusting training parameters according to the evaluation indexes of the model, selecting the optimal parameters and generating a CNN-BilSTM-Attention model with the best classification effect.

In the Android malicious program detection process, a classification method combining multiple models and an attention mechanism is used. Three representative features are selected, so that the coverage of the features in a sample is wider, a Convolutional Neural Network (CNN) is used for obtaining local features of the sample, a bidirectional long and short term memory network (BilSTM) captures bidirectional dependence, the relationship between words and contexts is closer, an Attention mechanism (Attention) is used for highlighting important features, the influence degree of irrelevant features is weakened, and finally, the fused features comprise local features and global context features. Therefore, the Android malicious program detection method based on deep learning meets the requirement on prediction precision.

Example 1

In the embodiment, 2306 malicious APKs are totally downloaded from the virusshare website to download the malicious application samples; and traversing hot ranking lists under different classification columns in the millet stores, and downloading normal application samples in batches by a Python web crawler technology to obtain 2280 benign APKs in total. The method of the invention is used for detecting the malicious program;

step 1, carrying out batch decompilation processing on the apk by using a Python and decompilation tool Apktool, traversing apk files under malicious sample directories and normal sample directories, and generating decompilation files such as android files, xml files and smali files by a decompilation command Apktool.

Step 2, respectively extracting the authority characteristics, the API characteristics and the 3-Gram characteristics from the decompiled files, and assuming the permission characteristics extracted from the malicious application samples and the normal application samples as P _n ＝(p _n1 ,p _n2 ,…p _nw ) Api is characterized by A _n ＝(a _n1 ,a _n2 ,…,a _nm ) Dalvik bytecode of D _n ＝(d _n1 ,d _n2 ,…,d _nu ) Using Dalvik instruction classification rules to remove instructions which do not affect the classification result, simplifying the instructions into instructions capable of expressing semantics, classifying the instructions with the same or similar functions into one class, representing the simplified class of instructions by using seven letter labels, acquiring data by using a sliding window with the length of 3 in a 3-Gram model, changing a statement into a plurality of segments with the length of N, and obtaining the 3-Gram with the characteristic of G _n ＝(g _n1 ,g _n2 ,…,g _nv ) Then the feature in a sample is X _n ＝{P _n ,A _n ,G _n Cutting the characteristic dimension into a uniform length of 200, performing truncation operation when the length exceeds 200, performing 0 supplementing operation when the length is less than 200, and converting the characteristic of each sample into X _n ＝(x ₁ ,x ₂ ,…,x _d ). Namely the input of the CNN-BilSTM-Attention model.

Step 3, firstly inputting a sample with the characteristics of X through the embedding layer _n ＝(x ₁ ,x ₂ ,…,x _d ) Converted into eigenvectors, the eigenvector matrix can be represented as

Wherein n represents the number of samples and d represents the number of dimensions, 200 dimensions, X in the method _nd And (3) representing the d-th characteristic in the nth sample, wherein R represents a target value, the target value has two values of 0 and 1, 1 represents that the sample is a normal sample, and 0 represents a malicious sample.

Taking the obtained characteristic vector matrix as the input of CNN, performing convolution processing on the characteristic vector matrix in a convolution layer, and assuming that the size of a convolution kernel is h multiplied by v, h is the height of the convolution kernel, v is the dimension of a word vector, w is a weight matrix, and X is the weight matrix _n+h-1 Is vector matrix, b is bias term, f is activation function, convolution formula is:

C _i ＝f(w*X _i:i+h-1 +b)

C _cnn ＝max{C _i }

meanwhile, a bidirectional long and short term memory network (BilSTM) model introducing an attention mechanism is used for extracting global context information, so that the trained model is not only related to the words per se, but also has a closer relation with the context. If it is

Is the positive sequence LSTM (sequence number TM),

The calculation formula is as follows:

wherein

score from the previous state S _i-1 And h _j And (4) calculating. The attention mechanism distributes attention of different degrees according to the importance degree of different characteristic information, information with a closer relation to the current can obtain more attention, and information without a great relation to the current can obtain less attention, so that the importance of the main factors on the model is increased, the influence of the secondary factors on the model is weakened, and the efficiency and the accuracy of the model are improved.

The features in the two models are fused together, sample features may be represented as C = { C _cnn ,C _bilstm And in order to enable the network to express the nonlinear mapping, an activation function needs to be introduced, and classification of application samples is realized by using Softmax. And (3) optimizing the loss by using a gradient descent algorithm Adam, repeatedly adjusting training parameters according to the evaluation indexes of the model, selecting the optimal parameters and generating a CNN-BilSTM-Attention model with the best classification effect.

And 4, extracting sample characteristics of the sample to be detected according to the method in the step 2 and inputting the sample characteristics into the trained CNN-BilSTM-orientation model to obtain a classification result.

Fig. 3 and 4 are a loss plot and an acc plot. It can be seen that when the model is initially trained, both the loss curve and the acc curve slightly oscillate, and as the number of iterations increases, the loss curve gradually decreases, while the acc curve gradually increases and becomes stable.

FIG. 5 is a graph comparing the experimental results of the method of the present application (F1) with different deep learning models such as LSTM, biLSTM, CNN-BiLSTM. As can be seen, acc and F of BilSTM ₁ The context information is extracted in two directions, which is more beneficial to the improvement of model performance than the information which can only learn the forward direction; acc and of CNN-BilSTMF ₁ The model training models are all larger than the BilSTM, so that compared with the single model, the multi-model fusion ensures that the performance of the training models is better; acc and F of CBA ₁ The model information is larger than CNN-BilSTM, the attention mechanism is introduced, so that the model is more concerned with useful information, and the model performance is greatly improved. In conclusion, the CBA multi-model combined attention mechanism method provided by the invention can effectively realize Android malicious program detection, and the detection performance is superior to other model comparison schemes.

Claims

1. A method for detecting Android malicious programs based on deep learning is characterized by comprising the following specific steps:

the CNN-BilSTM-Attention model comprises an input layer, a word embedding layer, a bidirectional long-short term memory (BilSTM) module fused with an Attention mechanism, a convolution layer, a full connection layer and a Softmax layer; during model training, firstly, word2vec is used for carrying out feature vectorization processing on a Word embedding layer, the CNN carries out convolution operation on a feature vector matrix, max-posing is used for extracting a maximum value and recombining feature vectors to obtain local features; then, using the BilSTM to capture bidirectional dependence, extracting global context information from the feature vector matrix, and using an Attention mechanism Attention to highlight important features; calculating final output, namely calculating a context hidden state, obtaining attention probability by using the context hidden state and the hidden state at the previous moment, weighting and summing the context hidden state through the probability, and obtaining final output after transformation; fusing the characteristics in the two models, introducing an activation function in order to enable the network to express nonlinear mapping, and classifying application samples by using Softmax;

2. The deep learning-based Android malware detection method of claim 1, wherein step 1 is specifically to generate a decompilated file including Android manifest.

3. The Android malicious program detection method based on deep learning of claim 2, wherein the method for extracting the authority features comprises: extracting a statement containing permission from a < uses-permission > tag of an android manifest.

The extraction method of the API characteristics comprises the following steps: traversing rows starting with 'invoke' in the smali file for extraction;

the extraction method of the 3-Gram features comprises the following steps: using Dalvik instruction classification rules in the smal files, removing instructions which do not influence classification results, simplifying the instructions into instructions which can express semantics, classifying the instructions according to the functions of the instructions, classifying the instructions with the same or similar functions into one class, and representing the simplified class of instructions by using seven letter labels; and acquiring data by using a sliding window with the length of 3 in the 3-Gram model, and changing a statement into a plurality of segments with the length of N, namely 3-Gram characteristics.

4. The deep learning-based Android malware detection method of claim 1, wherein the training process of the CNN-BilSTM-Attention model is as follows:

firstly, taking a sample set as input, and then converting sample characteristics into a characteristic vector at an embedding layer by using Word2 vec; secondly, the convolutional neural network selects 128 convolutional kernels with the sizes of 3, 4 and 5 respectively, convolution processing is carried out on the characteristic vector matrix in the convolutional layer, the maximum value is extracted by using a maximum pooling method, and the maximum value is obtainedReconstructing the feature vector to obtain local feature C _ccn (ii) a Obtaining global context information feature C by using BilSTM network introducing attention mechanism to feature vector input _blistm And finally, fusing the two features by the full connection layer to introduce an activation function, training by adopting a gradient descent algorithm Adam, and classifying the application samples by using Softmax.

5. The Android malicious program detection method based on deep learning of claim 4, wherein in the convolutional neural network training process, the convolutional kernel size is assumed to be h X v, h is the convolutional kernel height, v is the word vector dimension, w is the weight matrix, X is the X _n+h-1 Is a vector matrix, b is a bias term, f is an activation function, and the convolution calculation formula is as follows:

C _i ＝f(w*X _i:i+h-1 +b)

selecting the maximum value of the feature vector by a CNN convolution processing maximum pooling method, wherein the maximum pooling calculation formula is as follows:

C _cnn ＝max{C _i }。

6. the Android malicious program detection method based on deep learning of claim 4, wherein a bidirectional long-short term memory network model is used to extract global context information, so that the trained model is no longer only related to words, but has a closer relationship with context; if it is

In order to have the positive sequence LSTM,

different characteristicsThe influence degrees on classification are different, and an Attention mechanism is introduced at the same time in order to highlight important features and weaken the influence degree of irrelevant features, so that different degrees of Attention are allocated to different feature information; let T be the length of the input vector, h _j Representing the hidden layer state of the input vector after decoding, attention probability alpha _ij Representing an input word x _j For the current word y _i Attention probability of (2), then the global context information feature after attention is assigned

The calculation formula is as follows:

wherein

score from the previous state S _i-1 And h _j And (4) calculating.