CN115510445A - Android malicious program detection method based on deep learning - Google Patents

Android malicious program detection method based on deep learning Download PDF

Info

Publication number
CN115510445A
CN115510445A CN202211308160.3A CN202211308160A CN115510445A CN 115510445 A CN115510445 A CN 115510445A CN 202211308160 A CN202211308160 A CN 202211308160A CN 115510445 A CN115510445 A CN 115510445A
Authority
CN
China
Prior art keywords
bilstm
features
cnn
attention
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211308160.3A
Other languages
Chinese (zh)
Inventor
孙钦东
王艳
王伟
杨志海
许航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202211308160.3A priority Critical patent/CN115510445A/en
Publication of CN115510445A publication Critical patent/CN115510445A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Virology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a deep learning-based Android malicious program detection method, which comprises the following steps of: collecting malicious Android AKP samples and normal AKP samples, outputting decompiled files through a decompiling tool, extracting the authority characteristics, API characteristics and 3-Gram characteristics of the decompiled files, converting the authority characteristics into characteristic vectors serving as training receiving samples, obtaining the local characteristics of the samples through CNN for the established CNN-BilSTM-orientation model, obtaining the global characteristics through BilSTM, and reducing noise interference and highlighting important characteristics by adopting an Attention mechanism in the BilSTM; and finally, performing fusion training on the two types of characteristics, and then detecting the malicious program by adopting a trained CNN-BilSTM-Attention model. The method solves the problem of insufficient coverage caused by using single characteristics, and improves the prediction accuracy of the model.

Description

Android malicious program detection method based on deep learning
Technical Field
The invention belongs to the technical field of deep learning and malicious program detection, and relates to an Android malicious program detection method based on deep learning.
Background
With the continuous development of internet technology and the upgrading of smart phones, the smart phones are slowly changed from luxury goods to mass consumer goods of one part of the hands, and become portable intelligent mobile devices integrating multiple functions such as internet browsing, video chatting, photographing and listening to songs, travel health codes and the like, and smart phones with Android operating systems are popular among people due to higher cost performance. While the Android operating system is rapidly increased, malicious application software is also in a variety, and an attacker develops malicious software capable of stealing privacy, telephone fee deduction, remote control and the like by utilizing the openness of the Android system, the non-standardization of an application market and the like, so that the benefits of users are seriously damaged. Therefore, efficient and accurate Android malicious program detection is of great significance.
In order to improve the accuracy rate of Android malware detection, a deep learning algorithm for accurate feature extraction and adaptation is the key of a training model. The apk features comprise static features and dynamic features, the static features can be directly extracted from a decompiled apk file, the detection speed is high, but the detection accuracy rate is not high for Android software which is subjected to code confusion and uses a digital signature with an error format; the dynamic characteristics need to extract software behavior characteristics in the software running process, can extract various data and is not influenced by technologies such as code confusion, but the detection speed is slow, so that appropriate representative characteristics need to be selected. Machine learning is used as a common classification method and is commonly used in Android malicious program detection, such as Random Forest (Random Forest), support Vector Machine (SVM), and the like, but the characteristics required to be analyzed during training and testing a model are less in machine learning, shallow features of a sample are obtained, the features are manually extracted and then optimized, only the features are concerned about and context is ignored during feature analysis, and although the calculation speed of machine learning is high, the detection accuracy of a machine learning algorithm is low due to the problems.
The single CNN model mines local features of the sample, does not consider the relationship between words and contexts, and does not assign different attention according to the importance degree of the features to classification, so that the efficiency of the model and the classification accuracy are reduced.
Disclosure of Invention
The invention aims to provide a method for detecting Android malicious programs based on deep learning, which solves the problem that coverage is not wide enough due to the use of single features, adopts a multi-model mixed deep learning technology, can extract global context features while extracting local features of a sample, and can distribute attention of different degrees to different feature information due to the introduction of an attention mechanism so as to improve the prediction accuracy of a model.
The technical scheme adopted by the invention is as follows:
a deep learning-based Android malicious program detection method comprises the following specific steps:
step 1, collecting malicious Android AKP samples and normal AKP samples, and outputting decompilated files through a decompilation tool;
step 2, extracting authority features, API features and 3-Gram features in the decompiled file, and converting the authority features, the API features and the 3-Gram features into feature vectors;
step 3, establishing a CNN-BilSTM-orientation model, and training the CNN-BilSTM-orientation model by using the characteristic vector obtained in the step 2 as a sample set;
the CNN-BilSTM-Attention model comprises an input layer, a word embedding layer, a bidirectional long-short term memory (BilSTM) module fused with an Attention mechanism, a convolution layer, a full connection layer and a Softmax layer; during model training, firstly, word2vec is used for carrying out feature vectorization processing on a Word embedding layer, the CNN carries out convolution operation on a feature vector matrix, max-posing is used for extracting a maximum value and recombining feature vectors to obtain local features; then, bi-directional dependency is captured by using a BilSTM, global context information is extracted from the characteristic vector matrix, and an Attention mechanism Attention is used for highlighting important characteristics; calculating final output, namely calculating a context hidden state, obtaining attention probability by using the context hidden state and the hidden state at the previous moment, weighting and summing the context hidden state through the probability, and obtaining final output after transformation; fusing the characteristics in the two models, introducing an activation function in order to enable the network to express nonlinear mapping, and classifying application samples by using Softmax;
and 4, for the program to be detected, obtaining the characteristic vector by adopting the method in the step 2, and inputting the characteristic vector into the trained CNN-BilSTM-orientation model for detection to obtain a detection result.
The invention is also characterized in that:
specifically, the step 1 is that a decompiling command apktool.bat d-f [ apk file path ] [ output folder path ] generates a decompiling file comprising an android manifest.xml file and a smali file.
The method for extracting the authority features comprises the following steps: extracting a statement containing permission from a < uses-permission > tag of an android manifest file, wherein the statement is a permission characteristic; the extraction method of the API characteristics comprises the following steps: traversing rows starting with 'invoke' in the smal file for extraction; the extraction method of the 3-Gram characteristics comprises the following steps: the Dalvik instruction classification rule is used in the smal file, instructions which do not affect the classification result are removed, the instructions are simplified into instructions capable of expressing semantics, classification is carried out according to the functions of the instructions, the instructions with the same or similar functions are classified into one class, and seven letter labels are used for representing the simplified class of instructions. And acquiring data by using a sliding window with the length of 3 in the 3-Gram model, and changing one statement into a plurality of segments with the length of N, namely the 3-Gram features.
The training process of the CNN-BilSTM-Attention model is as follows:
firstly, taking a sample set as input, and then converting sample characteristics into a characteristic vector at an embedding layer by using Word2 vec; secondly, the convolutional neural network selects 128 convolutional kernels with the sizes of 3, 4 and 5 respectively, convolution processing is carried out on the feature vector matrix in the convolutional layer, the maximum value is extracted by using a maximum pooling method (max-pooling) and the feature vector is formed again to obtain the local feature C ccn (ii) a Obtaining global context information feature C by using BilSTM network introducing attention mechanism to feature vector input blistm Finally, the full connection layer fuses the two characteristics and introduces an activation functionAnd training by adopting a gradient descent algorithm Adam, and classifying the application samples by using Softmax.
In the process of convolutional neural network training, the size of a convolutional kernel is assumed to be h X v, h is the height of the convolutional kernel, v is the dimension of a word vector, w is a weight matrix, and X is n+h-1 Is a vector matrix, b is a bias term, f is an activation function, and the convolution calculation formula is as follows:
C i =f(w*X i:i+h-1 +b)
selecting the maximum value of the characteristic vector by a CNN convolution maximum pooling method (max-pooling), wherein the maximum pooling calculation formula is as follows:
C cnn =max{C i }。
the process of extracting global context information using the bidirectional long short term memory network (BilSTM) model is: if it is
Figure BDA0003906800270000051
In order to have the positive sequence LSTM,
Figure BDA0003906800270000052
if the order is reverse LSTM, the output state of the BilSTM is a combination of positive and reverse orders, and the calculation formula of the BilSTM is as follows:
Figure BDA0003906800270000053
different characteristics have different degrees of influence on classification, and in order to highlight important characteristics and weaken the degree of influence of irrelevant characteristics, an Attention mechanism is introduced at the same time, and different degrees of Attention are allocated to different characteristic information. Let T be the length of the input vector, h j Representing the hidden layer state of the input vector after decoding, attention probability alpha ij Representing an input word x j For the current word y i Attention probability of (2), then the global context information feature after attention is assigned
Figure BDA0003906800270000054
The calculation formula is as follows:
Figure BDA0003906800270000055
wherein
Figure BDA0003906800270000056
Figure BDA0003906800270000057
From the previous moment state S i-1 And h j And (4) calculating.
The invention has the beneficial effects that:
1) A plurality of representative features are selected, so that the coverage of the features on the apk file is wider, and the influence of code confusion on the features is reduced.
2) A multi-model fusion method in deep learning is selected, local features of a sample are obtained through CNN, biLSTM obtains global features, and compared with LSTM which only can learn forward information, biLSTM can better capture bidirectional dependence, so that the trained model is closer to the context, and an attention mechanism is adopted to reduce noise interference and highlight important features. Compared with the traditional machine learning, deep learning can be used for excavating deep features of a sample, manual optimization processing is not needed, a classification model with high detection rate can be trained through self learning of a neural network, and the classification accuracy rate is remarkably improved compared with a single classification model in the deep learning.
Drawings
FIG. 1 is a diagram of the structure of the CNN-BilSTM-Attention model of an example of the method of the present invention.
FIG. 2 is a flow chart of the detection of apk samples by the method of the present invention.
FIG. 3 is a loss rate (loss) graph of example 1 of the present invention;
FIG. 4 is a graph of accuracy (acc) of example 1 of the present invention
FIG. 5 is a graph comparing experimental results of examples of the method of the present invention with different deep learning models such as LSTM, bilSTM, CNN-BilSTM.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention discloses a deep learning-based Android malicious program detection method, which comprises the following specific steps of:
step 1, collecting malicious Android AKP samples and normal AKP samples, and outputting decompilated files through a decompilation tool; specifically, a decompilation file comprising android manifest, xml file, smali file and the like is generated in a decompilation command apktool.
Step 2, extracting authority features, API features and 3-Gram features in the decompiled file, and converting the authority features, the API features and the 3-Gram features into feature vectors; the method specifically comprises the following steps:
the decompiled android manifest xml file contains all permissions of the application, each permission written into the configuration file is wrapped by a < uses-permission > tag and contains an android.
The decompiled smali file contains all the APIs of the application, the content contained in the keywords ". Method" and ". End method" in the smali file is called by a method, and the API call is started from the "invoke-". Therefore, by traversing the lines in the smali file beginning with "invoke-", API features can be extracted.
The decompiled smali file also contains all Dalvik byte codes applied, dalvik instruction classification rules are used, instructions which do not influence classification results are removed, the Dalvik instruction classification rules are simplified into instructions capable of expressing semantics, classification is carried out according to the functions of the instructions, the instructions with the same or similar functions are classified into one class, and seven letter labels are used for representing the simplified class of instructions. And acquiring data by using a sliding window with the length of 3 in the 3-Gram model, and changing one sentence into a plurality of segments with the length of N to extract the 3-Gram features.
Connecting the obtained authority features and API feature 3-Gram features in series to form a sample set, cutting feature dimensions into uniform lengths of 200, and performing truncation when the length exceeds 200Operation, less than 200 make-0 operations, convert the features of each sample into X n =(x 1 ,x 2 ,…,x d ). Namely the input of the CNN-BilSTM-Attention model.
And 3, establishing a CNN-BilSTM-orientation model, wherein the model comprises an input layer, a word embedding layer, a bidirectional long-short term memory (BilSTM) module fused with an Attention mechanism, a convolution layer, a full connection layer and a Softmax layer. Specifically, CNN and BiLSTM are used to obtain local features and global context features, respectively, and the Attention makes the model focus more on important features. When the model is trained, firstly, word2vec is used for carrying out feature vectorization processing on a Word embedding layer, the CNN carries out convolution operation on a feature vector matrix, max-posing is used for extracting a maximum value, and the maximum value is recombined into a feature vector to obtain local features. The bi-directional dependencies are then captured using BilSTM, global context information is extracted for the feature vector matrix, and the Attention mechanism Attention is used to highlight important features. Calculating final output, namely calculating a context hidden state, obtaining attention probability by using the context hidden state and the hidden state at the previous moment, weighting and summing the context hidden state through the probability, and obtaining final output after transformation; fusing the features in the two models, in order to enable the network to express nonlinear mapping, an activation function needs to be introduced, and classification of application samples is achieved by using Softmax.
In order to ensure the accuracy of the prediction result of the CNN-BilSTM-Attention model, the sample set obtained in the step 2 is adopted to train the CNN-BilSTM-Attention model. The specific training process is as follows:
taking a sample set as input, and converting sample characteristics into a characteristic vector at an embedding layer by using Word2 vec; and respectively inputting the feature vectors into a Convolutional Neural Network (CNN) network and a bidirectional long-short term memory (BilSTM) network introducing an attention mechanism to respectively extract local features and acquire global context information features, then fusing the two obtained features, and classifying the application samples by using Softmax.
Then, the Convolutional Neural Network (CNN) selects 128 convolutional kernels with the sizes of 3, 4 and 5 respectively, convolution processing is carried out on the feature vector matrix in the convolutional layer, the maximum value is extracted by using a maximum pooling method (max-pooling) and the feature vectors are recombined to obtain local features:
taking the obtained characteristic vector matrix as the input of CNN, performing convolution processing on the characteristic vector matrix in a convolution layer, and assuming that the size of a convolution kernel is h multiplied by v, h is the height of the convolution kernel, v is the dimension of a word vector, w is a weight matrix, and X is the weight matrix n+h-1 Is a vector matrix, b is a bias term, f is an activation function, and the convolution calculation formula is as follows:
C i =f(w*X i:i+h-1 +b)
selecting the maximum value of the characteristic vector by a CNN convolution maximum pooling method (max-pooling), wherein the maximum pooling calculation formula is as follows:
C cnn =max{C i }
meanwhile, a BilSTM bidirectional long-short term memory network (BilSTM) introducing attention mechanism is used for extracting global context information, so that the trained model is not only related to the words but also has a closer relation to the context. If it is
Figure BDA0003906800270000091
In order to have the positive sequence LSTM,
Figure BDA0003906800270000092
if the order is reverse LSTM, the output state of the BiLSTM is a combination of positive and reverse orders, and the calculation formula of the BiLSTM is as follows:
Figure BDA0003906800270000093
different characteristics have different degrees of influence on classification, and in order to highlight important characteristics and weaken the degree of influence of irrelevant characteristics, an Attention mechanism is introduced at the same time, and different degrees of Attention are allocated to different characteristic information. Let T be the length of the input vector, h j Representing the hidden layer state, attention probability alpha, of the input vector after decoding ij Representing an input word x j For the current word y i Attention probability of (2), then the global context information feature after attention is assigned
Figure BDA0003906800270000094
The calculation formula is as follows:
Figure BDA0003906800270000095
when fusing features in the two models, the sample feature may be represented as C = { C cnn ,C bilstm In order to make the network express nonlinear mapping, an activation function needs to be introduced, and Softmax is used to realize classification of application samples. And (3) optimizing the loss by using a gradient descent algorithm Adam, repeatedly adjusting training parameters according to the evaluation indexes of the model, selecting the optimal parameters and generating a CNN-BilSTM-Attention model with the best classification effect.
And 4, for the program to be detected, obtaining the characteristic vector by adopting the method in the step 2, and inputting the characteristic vector into the trained CNN-BilSTM-orientation model for detection to obtain a detection result.
In the Android malicious program detection process, a classification method combining multiple models and an attention mechanism is used. Three representative features are selected, so that the coverage of the features in a sample is wider, a Convolutional Neural Network (CNN) is used for obtaining local features of the sample, a bidirectional long and short term memory network (BilSTM) captures bidirectional dependence, the relationship between words and contexts is closer, an Attention mechanism (Attention) is used for highlighting important features, the influence degree of irrelevant features is weakened, and finally, the fused features comprise local features and global context features. Therefore, the Android malicious program detection method based on deep learning meets the requirement on prediction precision.
Example 1
In the embodiment, 2306 malicious APKs are totally downloaded from the virusshare website to download the malicious application samples; and traversing hot ranking lists under different classification columns in the millet stores, and downloading normal application samples in batches by a Python web crawler technology to obtain 2280 benign APKs in total. The method of the invention is used for detecting the malicious program;
step 1, carrying out batch decompilation processing on the apk by using a Python and decompilation tool Apktool, traversing apk files under malicious sample directories and normal sample directories, and generating decompilation files such as android files, xml files and smali files by a decompilation command Apktool.
Step 2, respectively extracting the authority characteristics, the API characteristics and the 3-Gram characteristics from the decompiled files, and assuming the permission characteristics extracted from the malicious application samples and the normal application samples as P n =(p n1 ,p n2 ,…p nw ) Api is characterized by A n =(a n1 ,a n2 ,…,a nm ) Dalvik bytecode of D n =(d n1 ,d n2 ,…,d nu ) Using Dalvik instruction classification rules to remove instructions which do not affect the classification result, simplifying the instructions into instructions capable of expressing semantics, classifying the instructions with the same or similar functions into one class, representing the simplified class of instructions by using seven letter labels, acquiring data by using a sliding window with the length of 3 in a 3-Gram model, changing a statement into a plurality of segments with the length of N, and obtaining the 3-Gram with the characteristic of G n =(g n1 ,g n2 ,…,g nv ) Then the feature in a sample is X n ={P n ,A n ,G n Cutting the characteristic dimension into a uniform length of 200, performing truncation operation when the length exceeds 200, performing 0 supplementing operation when the length is less than 200, and converting the characteristic of each sample into X n =(x 1 ,x 2 ,…,x d ). Namely the input of the CNN-BilSTM-Attention model.
Step 3, firstly inputting a sample with the characteristics of X through the embedding layer n =(x 1 ,x 2 ,…,x d ) Converted into eigenvectors, the eigenvector matrix can be represented as
Figure BDA0003906800270000111
Wherein n represents the number of samples and d represents the number of dimensions, 200 dimensions, X in the method nd And (3) representing the d-th characteristic in the nth sample, wherein R represents a target value, the target value has two values of 0 and 1, 1 represents that the sample is a normal sample, and 0 represents a malicious sample.
Taking the obtained characteristic vector matrix as the input of CNN, performing convolution processing on the characteristic vector matrix in a convolution layer, and assuming that the size of a convolution kernel is h multiplied by v, h is the height of the convolution kernel, v is the dimension of a word vector, w is a weight matrix, and X is the weight matrix n+h-1 Is vector matrix, b is bias term, f is activation function, convolution formula is:
C i =f(w*X i:i+h-1 +b)
selecting the maximum value of the characteristic vector by a CNN convolution maximum pooling method (max-pooling), wherein the maximum pooling calculation formula is as follows:
C cnn =max{C i }
meanwhile, a bidirectional long and short term memory network (BilSTM) model introducing an attention mechanism is used for extracting global context information, so that the trained model is not only related to the words per se, but also has a closer relation with the context. If it is
Figure BDA0003906800270000121
Is the positive sequence LSTM (sequence number TM),
Figure BDA0003906800270000122
if the order is reverse LSTM, the output state of the BiLSTM is a combination of positive and reverse orders, and the calculation formula of the BiLSTM is as follows:
Figure BDA0003906800270000123
different characteristics have different degrees of influence on classification, and in order to highlight important characteristics and weaken the degree of influence of irrelevant characteristics, an Attention mechanism is introduced at the same time, and different degrees of Attention are allocated to different characteristic information. Let T be the length of the input vector, h j Representing the hidden layer state of the input vector after decoding, attention probability alpha ij Representing an input word x j For the current word y i Attention probability of (2), then the global context information feature after attention is assigned
Figure BDA0003906800270000124
The calculation formula is as follows:
Figure BDA0003906800270000125
wherein
Figure BDA0003906800270000126
score from the previous state S i-1 And h j And (4) calculating. The attention mechanism distributes attention of different degrees according to the importance degree of different characteristic information, information with a closer relation to the current can obtain more attention, and information without a great relation to the current can obtain less attention, so that the importance of the main factors on the model is increased, the influence of the secondary factors on the model is weakened, and the efficiency and the accuracy of the model are improved.
The features in the two models are fused together, sample features may be represented as C = { C cnn ,C bilstm And in order to enable the network to express the nonlinear mapping, an activation function needs to be introduced, and classification of application samples is realized by using Softmax. And (3) optimizing the loss by using a gradient descent algorithm Adam, repeatedly adjusting training parameters according to the evaluation indexes of the model, selecting the optimal parameters and generating a CNN-BilSTM-Attention model with the best classification effect.
And 4, extracting sample characteristics of the sample to be detected according to the method in the step 2 and inputting the sample characteristics into the trained CNN-BilSTM-orientation model to obtain a classification result.
Fig. 3 and 4 are a loss plot and an acc plot. It can be seen that when the model is initially trained, both the loss curve and the acc curve slightly oscillate, and as the number of iterations increases, the loss curve gradually decreases, while the acc curve gradually increases and becomes stable.
FIG. 5 is a graph comparing the experimental results of the method of the present application (F1) with different deep learning models such as LSTM, biLSTM, CNN-BiLSTM. As can be seen, acc and F of BilSTM 1 The context information is extracted in two directions, which is more beneficial to the improvement of model performance than the information which can only learn the forward direction; acc and of CNN-BilSTMF 1 The model training models are all larger than the BilSTM, so that compared with the single model, the multi-model fusion ensures that the performance of the training models is better; acc and F of CBA 1 The model information is larger than CNN-BilSTM, the attention mechanism is introduced, so that the model is more concerned with useful information, and the model performance is greatly improved. In conclusion, the CBA multi-model combined attention mechanism method provided by the invention can effectively realize Android malicious program detection, and the detection performance is superior to other model comparison schemes.

Claims (6)

1. A method for detecting Android malicious programs based on deep learning is characterized by comprising the following specific steps:
step 1, collecting malicious Android AKP samples and normal AKP samples, and outputting decompilated files through a decompilation tool;
step 2, extracting authority features, API features and 3-Gram features in the decompiled file, and converting the authority features, the API features and the 3-Gram features into feature vectors;
step 3, establishing a CNN-BilSTM-orientation model, and training the CNN-BilSTM-orientation model by using the characteristic vector obtained in the step 2 as a sample set;
the CNN-BilSTM-Attention model comprises an input layer, a word embedding layer, a bidirectional long-short term memory (BilSTM) module fused with an Attention mechanism, a convolution layer, a full connection layer and a Softmax layer; during model training, firstly, word2vec is used for carrying out feature vectorization processing on a Word embedding layer, the CNN carries out convolution operation on a feature vector matrix, max-posing is used for extracting a maximum value and recombining feature vectors to obtain local features; then, using the BilSTM to capture bidirectional dependence, extracting global context information from the feature vector matrix, and using an Attention mechanism Attention to highlight important features; calculating final output, namely calculating a context hidden state, obtaining attention probability by using the context hidden state and the hidden state at the previous moment, weighting and summing the context hidden state through the probability, and obtaining final output after transformation; fusing the characteristics in the two models, introducing an activation function in order to enable the network to express nonlinear mapping, and classifying application samples by using Softmax;
and 4, for the program to be detected, obtaining the characteristic vector by adopting the method in the step 2, and inputting the characteristic vector into the trained CNN-BilSTM-orientation model for detection to obtain a detection result.
2. The deep learning-based Android malware detection method of claim 1, wherein step 1 is specifically to generate a decompilated file including Android manifest.
3. The Android malicious program detection method based on deep learning of claim 2, wherein the method for extracting the authority features comprises: extracting a statement containing permission from a < uses-permission > tag of an android manifest.
The extraction method of the API characteristics comprises the following steps: traversing rows starting with 'invoke' in the smali file for extraction;
the extraction method of the 3-Gram features comprises the following steps: using Dalvik instruction classification rules in the smal files, removing instructions which do not influence classification results, simplifying the instructions into instructions which can express semantics, classifying the instructions according to the functions of the instructions, classifying the instructions with the same or similar functions into one class, and representing the simplified class of instructions by using seven letter labels; and acquiring data by using a sliding window with the length of 3 in the 3-Gram model, and changing a statement into a plurality of segments with the length of N, namely 3-Gram characteristics.
4. The deep learning-based Android malware detection method of claim 1, wherein the training process of the CNN-BilSTM-Attention model is as follows:
firstly, taking a sample set as input, and then converting sample characteristics into a characteristic vector at an embedding layer by using Word2 vec; secondly, the convolutional neural network selects 128 convolutional kernels with the sizes of 3, 4 and 5 respectively, convolution processing is carried out on the characteristic vector matrix in the convolutional layer, the maximum value is extracted by using a maximum pooling method, and the maximum value is obtainedReconstructing the feature vector to obtain local feature C ccn (ii) a Obtaining global context information feature C by using BilSTM network introducing attention mechanism to feature vector input blistm And finally, fusing the two features by the full connection layer to introduce an activation function, training by adopting a gradient descent algorithm Adam, and classifying the application samples by using Softmax.
5. The Android malicious program detection method based on deep learning of claim 4, wherein in the convolutional neural network training process, the convolutional kernel size is assumed to be h X v, h is the convolutional kernel height, v is the word vector dimension, w is the weight matrix, X is the X n+h-1 Is a vector matrix, b is a bias term, f is an activation function, and the convolution calculation formula is as follows:
C i =f(w*X i:i+h-1 +b)
selecting the maximum value of the feature vector by a CNN convolution processing maximum pooling method, wherein the maximum pooling calculation formula is as follows:
C cnn =max{C i }。
6. the Android malicious program detection method based on deep learning of claim 4, wherein a bidirectional long-short term memory network model is used to extract global context information, so that the trained model is no longer only related to words, but has a closer relationship with context; if it is
Figure FDA0003906800260000031
In order to have the positive sequence LSTM,
Figure FDA0003906800260000032
if the order is reverse LSTM, the output state of the BilSTM is a combination of positive and reverse orders, and the calculation formula of the BilSTM is as follows:
Figure FDA0003906800260000033
different characteristicsThe influence degrees on classification are different, and an Attention mechanism is introduced at the same time in order to highlight important features and weaken the influence degree of irrelevant features, so that different degrees of Attention are allocated to different feature information; let T be the length of the input vector, h j Representing the hidden layer state of the input vector after decoding, attention probability alpha ij Representing an input word x j For the current word y i Attention probability of (2), then the global context information feature after attention is assigned
Figure FDA0003906800260000041
The calculation formula is as follows:
Figure FDA0003906800260000042
wherein
Figure FDA0003906800260000043
score from the previous state S i-1 And h j And (4) calculating.
CN202211308160.3A 2022-10-25 2022-10-25 Android malicious program detection method based on deep learning Pending CN115510445A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211308160.3A CN115510445A (en) 2022-10-25 2022-10-25 Android malicious program detection method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211308160.3A CN115510445A (en) 2022-10-25 2022-10-25 Android malicious program detection method based on deep learning

Publications (1)

Publication Number Publication Date
CN115510445A true CN115510445A (en) 2022-12-23

Family

ID=84512064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211308160.3A Pending CN115510445A (en) 2022-10-25 2022-10-25 Android malicious program detection method based on deep learning

Country Status (1)

Country Link
CN (1) CN115510445A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116312861A (en) * 2023-05-09 2023-06-23 济南作为科技有限公司 Denitration system gas concentration prediction method, device, equipment and storage medium
CN117354067A (en) * 2023-12-06 2024-01-05 南京先维信息技术有限公司 Malicious code detection method and system
CN117972701A (en) * 2024-04-01 2024-05-03 山东省计算中心(国家超级计算济南中心) Anti-confusion malicious code classification method and system based on multi-feature fusion
CN118036008A (en) * 2024-04-15 2024-05-14 北京大学 Malicious file disguising detection method
CN117972701B (en) * 2024-04-01 2024-06-07 山东省计算中心(国家超级计算济南中心) Anti-confusion malicious code classification method and system based on multi-feature fusion

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116312861A (en) * 2023-05-09 2023-06-23 济南作为科技有限公司 Denitration system gas concentration prediction method, device, equipment and storage medium
CN117354067A (en) * 2023-12-06 2024-01-05 南京先维信息技术有限公司 Malicious code detection method and system
CN117354067B (en) * 2023-12-06 2024-02-23 南京先维信息技术有限公司 Malicious code detection method and system
CN117972701A (en) * 2024-04-01 2024-05-03 山东省计算中心(国家超级计算济南中心) Anti-confusion malicious code classification method and system based on multi-feature fusion
CN117972701B (en) * 2024-04-01 2024-06-07 山东省计算中心(国家超级计算济南中心) Anti-confusion malicious code classification method and system based on multi-feature fusion
CN118036008A (en) * 2024-04-15 2024-05-14 北京大学 Malicious file disguising detection method

Similar Documents

Publication Publication Date Title
CN115510445A (en) Android malicious program detection method based on deep learning
CN109753800B (en) Android malicious application detection method and system fusing frequent item set and random forest algorithm
CN114730339A (en) Detecting unknown malicious content in a computer system
Jung et al. Malware classification using byte sequence information
CN109033833B (en) Malicious code classification method based on multiple features and feature selection
CN111026858A (en) Project information processing method and device based on project recommendation model
US20220114256A1 (en) Malware classification and detection using audio descriptors
CN109086348B (en) Hyperlink processing method and device and storage medium
CN111314388A (en) Method and apparatus for detecting SQL injection
CN114398479A (en) Text classification method, device and medium based on time sequence interaction graph neural network
El Fiky et al. Detection of android malware using machine learning
CN115860836A (en) E-commerce service pushing method and system based on user behavior big data analysis
CN110866257A (en) Trojan detection method and device, electronic equipment and storage medium
Park et al. A vision transformer enhanced with patch encoding for malware classification
CN112491816A (en) Service data processing method and device
CN115314268B (en) Malicious encryption traffic detection method and system based on traffic fingerprint and behavior
CN114817925B (en) Android malicious software detection method and system based on multi-modal graph features
CN113111346A (en) Multi-engine WebShell script file detection method and system
CN110941828A (en) Android malicious software static detection method based on android GRU
CN110990834A (en) Static detection method, system and medium for android malicious software
CN116150371A (en) Asset repayment plan mass data processing method based on sharingJDBC
CN116975340A (en) Information retrieval method, apparatus, device, program product, and storage medium
Pu et al. BERT‐Embedding‐Based JSP Webshell Detection on Bytecode Level Using XGBoost
CN114491528A (en) Malicious software detection method, device and equipment
US11838322B2 (en) Phishing site detection device, phishing site detection method and phishing site detection program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination