Android malicious software static detection method based on android GRU
Technical Field
The invention relates to a static detection method for android malicious software, in particular to a static detection method for android malicious software based on android.
Background
At present, static detection is carried out on unknown android malware at present, the existing deep learning technology is generally directly applied to the static detection of the android malware, and the detection effect on the android malware is not obvious in the fields of images and the like due to the fact that the characteristics of the android malware are not considered.
Disclosure of Invention
Based on the problems, the invention provides an android malicious software static detection method based on android GRU. Firstly, reverse engineering processing is carried out on an android APK file, a sensitive function calling sequence and the entries characteristics are extracted from the android APK file, and the sensitive function calling sequence and the entries characteristics are used as training data of a deep learning model. For malware, there is some similarity between sensitive function call sequences of different malware. The invention improves the GRU structure by adopting a text similarity principle, and provides an android malicious software detection model-android GRU based on GRU.
The invention adopts the technical scheme that:
an android malicious software static detection method based on android is characterized by comprising the following steps:
1) decompiling the android APK file by using an android reverse tool, analyzing android Manifest.xml and extracting the entries characteristics used by the android application program;
2) py is adopted to generate a function call graph and extract sensitive function call sequences from the function call graph;
3) the model training module is responsible for training the android GRU model based on the extracted static features, and the detection module detects unknown android APK samples through the trained android GRU model.
The specific method in the step 2) is as follows:
2.1) preprocessing the function call graph, filtering the function call graph through an android reverse tool, and simplifying the function call graph to only contain sensitive function calls;
and 2.2) traversing the sensitive function call graph, extracting a sensitive function call sequence from the graph, and taking the extracted sensitive function call sequence as training data.
In the step 3), the android model is as follows: the method is characterized in that a text similarity principle is combined with a GRU structure, the internal structure of the GRU structure is improved by analyzing a threshold mechanism of the GRU structure, and the provided GRU-based android malicious software detection model is provided.
In the step 3), the step of the method is that,
3.1) similarity calculation based on input data:
input data x of GRUtThe method is characterized in that vectorization representation of original data is carried out, similarity calculation is carried out on input data of two adjacent GRU units, and a similarity can be obtained:s=sim(xt-1,xt) (ii) a X can be obtained from the similarityt-1And xtDifference information between: Δ x ═ 1-s) xt-1(ii) a The GRU structure for performing similarity calculation on input data is named as InputGRU;
inputting difference information between input data of two adjacent GRU units and current input data into a reset gate and an update gate together, controlling information transmission process by the difference information and learning more abstract information from the difference information:
zt=σ(Wzxt+UΔx(1-sim(xt-1,xt))xt-1+Uzht-1+bz) (3)
rt=σ(Wrxt+UΔx(1-sim(xt-1,xt))xt-1+Urht-1+br) (4)
for candidate state
Selecting input data x
tAs input, the input information of the current time step is retained:
hidden state htIs information learned from the input data and hidden states; at time step t, hidden state h of InputGRUtThe calculation formula of (a) is as follows:
in the step 3), the step of the method is that,
3.2) similarity calculation based on hidden states:
hidden state h of GRU input at time step tt-1All information including input data for the first t-1 time steps; hidden state h of GRU input at time step t-1t-2Involving the first t-2 time stepsInputting all information of the data; similar to similarity calculation based on input data, for ht-1And ht-2Similarity calculation was performed: s ═ sim (h)t-2,ht-1) And the difference information of the two is as follows: Δ h ═ 1-s) ht-2(ii) a The GRU structure for carrying out similarity calculation on the hidden state is named as HiddenGRU;
hidden state h output to the first two GRU units of the current time step tt-2、ht-1Calculating the similarity to obtain the difference information between the two, and combining the difference information with the hidden state h currently inputt-1Input into reset and update gates:
rt=σ(Wrxt+Urht-1+UΔh(1-sim(ht-2,ht-1))ht-2+br) (7)
zt=σ(Wzxt+Uzht-1+UΔh(1-sim(ht-2,ht-1))ht-2+bz) (8)
for candidate state
And hidden state h
tThe calculation formula of the two is the same as that in the InputGRU structure.
In the step 3), the step of the method is that,
3.3) similarity calculation based on input data and hidden states
Similarity calculation is carried out on input data and hidden states in a GRU structure at the same time, so that more abstract information can be learned from the data; carrying out similarity calculation on input data and a hidden state in a GRU structure at the same time, and naming the GRU structure as an InputHiddenGRU;
rt=σ(Wrxt+UΔx(1-sim(xt-1,xt))xt-1+Urht-1+UΔh(1-sim(ht-2,ht-1))ht-2+br) (9)
zt=σ(Wzxt+UΔx(1-sim(xt-1,xt))xt-1+Uzht-1+UΔh(1-sim(ht-2,ht-1))ht-2+bz) (10)。
for three different GRU structures 3.1) -3.3), the following formula can be used for abstract representation:
NewGRU=i*InputGRU+j*HiddenGRU,{i,j}∈{0,1} (11)
NewGRU represents an abstract definition of the structure of a GRU;
since i and j can only take values between 0,1, the above formula can represent 4 different GRU structures as follows:
when i is 0 and j is 0, the structure will degenerate into a native GRU structure;
when i is 1 and j is 0, the structure is an InputGRU structure, and similarity calculation is carried out on input data only;
when i is 0 and j is 1, the structure is a HiddenGRU structure, and similarity calculation is carried out on the hidden state only;
when i is 1 and j is 1, the structure is an InputHiddenGRU structure, and similarity calculation is simultaneously carried out on input data and a hidden state in the GRU structure.
By adopting the scheme, the invention provides the android malicious software static detection method based on the android GRU. Through the technical method, the method solves the problem of combining the characteristics of the android malicious software with a deep learning model, and can obviously improve the detection effect.
Description of the drawings:
FIG. 1 is an overall architecture diagram of the present invention.
Fig. 2 is a schematic diagram of the internal structure of the InputGRU.
FIG. 3 is a schematic diagram of the internal structure of HiddenGRU.
FIG. 4 shows the internal structure of an InputHiddenGRU.
Fig. 5 is a schematic diagram of the AndroGRU model.
Fig. 6 is a schematic diagram of the internal structure of a GRU.
Detailed Description
An android malicious software static detection method based on android is characterized by comprising the following steps:
1) decompiling the android APK file by using an android reverse tool, analyzing android Manifest.xml and extracting the entries characteristics used by the android application program;
2) py is adopted to generate a function call graph and extract sensitive function call sequences from the function call graph;
2.1) preprocessing the function call graph, filtering the function call graph through an android reverse tool, and simplifying the function call graph to only contain sensitive function calls;
and 2.2) traversing the sensitive function call graph, extracting a sensitive function call sequence from the graph, and taking the extracted sensitive function call sequence as training data.
3) The model training module is responsible for training the android GRU model based on the extracted static features, and the detection module detects unknown android APK samples through the trained android GRU model.
The AndroGRU model is: the method is characterized in that a text similarity principle is combined with a GRU structure, the internal structure of the GRU structure is improved by analyzing a threshold mechanism of the GRU structure, and the provided GRU-based android malicious software detection model is provided.
3.1) similarity calculation based on input data:
input data x of GRUtThe vectorization representation of the original data is performed, similarity calculation is performed on input data of two adjacent GRU units, and a similarity can be obtained: s ═ sim (x)t-1,xt) (ii) a X can be obtained from the similarityt-1And xtDifference information between: Δ x ═ 1-s) xt-1(ii) a The GRU structure for performing similarity calculation on input data is named as InputGRU;
inputting difference information between input data of two adjacent GRU units and current input data into a reset gate and an update gate together, controlling information transmission process by the difference information and learning more abstract information from the difference information:
zt=σ(Wzxt+UΔx(1-sim(xt-1,xt))xt-1+Uzht-1+bz) (3)
rt=σ(Wrxt+UΔx(1-sim(xt-1,xt))xt-1+Urht-1+br) (4)
all models are GRU based, the GRU model is as follows:
a Recurrent Neural Network (RNN) is suitable for processing time series type data, and has a wide application in the field of natural language processing. The GRU is a special RNN model, solves the problem of gradient disappearance of a native RNN, and is widely applied to tasks such as text classification. The GRU uses a threshold mechanism to control the state of incoming data without using separate memory cells. There are two types of threshold structures in the GRU: a reset gate and an update gate which together control how the GRU internal structure learns from the input data and hidden state output by the previous GRU unit, the GRU internal structure being as shown:
wherein, at the t time step, the hidden state htThe calculation formula of (a) is as follows:
zt=σ(W2xt+Uzht-1+bz)
rt=σ(Wrxt+Urht-1+br)
x
tis the input vector of the t time step, and sigma is sigmoid activation function W
z、W
hAnd W
rIs a mapping matrix, U
z、U
hAnd U
rIs the weight matrix and b is the offset. x is the number of
tAnd h
tIs the input of the GRU structure at the t-th time step, r
tIs the output of the reset gate, z
tIs the output of the update gate. Hidden state h
tThe information is learned from input data and hidden states and is controlled by a reset gate and an update gate together, wherein the reset gate determines how much state information of the previous t-1 time step is discarded, and the smaller the value of the reset gate is, the more the discarded information is; the update gate determines how much state information for the previous t-1 time step is retained. That is, the reset gate and the update gate are able to store and filter information from the input data and hidden states. Candidate hidden states
In addition to storing the incoming data at time step t containing information, information of the hidden state controlled by the reset gate is also stored.
For candidate state
Selecting input data x
tAs input, the input information of the current time step is retained:
hidden state htIs information learned from the input data and hidden states; at time step t, hidden state h of InputGRUtThe calculation formula of (a) is as follows:
3.2) similarity calculation based on hidden states:
hidden state h of GRU input at time step tt-1All information including input data for the first t-1 time steps; hidden state h of GRU input at time step t-1t-2All information including input data for the first t-2 time steps; similar to similarity calculation based on input data, for ht-1And ht-2Similarity calculation was performed: s ═ sim (h)t-2,ht-1) Difference of the twoThe different information is: Δ h ═ 1-s) ht-2(ii) a The GRU structure for carrying out similarity calculation on the hidden state is named as HiddenGRU;
hidden state h output to the first two GRU units of the current time step tt-2、ht-1Calculating the similarity to obtain the difference information between the two, and combining the difference information with the hidden state h currently inputt-1Input into reset and update gates:
rt=σ(Wrxt+Urht-1+UΔh(1-sim(ht-2,ht-1))ht-2+br) (7)
zt=σ(Wzxt+Uzht-1+UΔh(1-sim(ht-2,ht-1))ht-2+bz) (8)
for candidate state
And hidden state h
tThe calculation formula of the two is the same as that in the InputGRU structure.
3.3) similarity calculation based on input data and hidden states
Similarity calculation is carried out on input data and hidden states in a GRU structure at the same time, so that more abstract information can be learned from the data; carrying out similarity calculation on input data and a hidden state in a GRU structure at the same time, and naming the GRU structure as an InputHiddenGRU;
rt=σ(Wrxt+UΔx(1-sim(xt-1,xt))xt-1+Urht-1+UΔh(1-sim(ht-2,ht-1))ht-2+br) (9)
zt=σ(Wzxt+UΔx(1-sim(xt-1,xt))xt-1+Uzht-1+UΔh(1-sim(ht-2,ht-1))ht-2+bz) (10)
for three different GRU structures 3.1) -3.3), the following formula can be used for abstract representation:
NewGRU=i*InputGRU+j*HiddenGRU,{i,j}∈{0,1} (11)
NewGRU represents an abstract definition of the structure of a GRU;
since i and j can only take values between 0,1, the above formula can represent 4 different GRU structures as follows:
when i is 0 and j is 0, the structure will degenerate into a native GRU structure;
when i is 1 and j is 0, the structure is an InputGRU structure, and similarity calculation is carried out on input data only;
when i is 0 and j is 1, the structure is a HiddenGRU structure, and similarity calculation is carried out on the hidden state only;
when i is 1 and j is 1, the structure is an InputHiddenGRU structure, and similarity calculation is simultaneously carried out on input data and a hidden state in the GRU structure.
Example 1:
1 integral framework
The overall architecture 1 of the GRU-based android malicious software detection method comprises 4 parts: the device comprises a data collection module, a static characteristic extraction module, a model training module and a detection module.
The data collection module is used for collecting available android software data sets, and the data collection module comprises normal android samples and malicious android samples, and the android samples are usually grabbed from platforms such as an android application market and a malicious software forum in a crawler mode. The static feature extraction module is responsible for extracting sensitive function calling sequences and entries static features from the android APK file; firstly, decompiling an android APK file by using an android reverse tool, analyzing android Manifest.xml and extracting the entries characteristics used by an android application program; py is then used to generate a function call graph and extract sensitive function call sequences from it. The model training module is responsible for training the android model based on the extracted static features. And the detection module detects unknown android APK samples through the trained android model.
2 static feature extraction
When the android application program is externally published, a developer can package a source file of the application program. An android APK file is a compressed package and its size typically varies from a few KB to tens of MB, which typically consumes more computing resources if trained directly, while also not being able to extract the critical information in the APK file well. Therefore, this section will take the internal files of android APK as research objects. Android APK files generally include META-INF/, res/, libs/, android manifest, classes, frequencies, ars and other files, and some files are unreadable, so that a reverse tool (android) is required to reversely engineer the APK file and extract static information such as function call graphs, control flow graphs, permissions, entries and the like from the APK file.
For an android APK file, an android manifest.xml file provides information required for application installation and execution; dex contains features that can describe its behavior. For byte code files, information is included from a coarse level of granularity, such as packets, to a fine level of granularity, such as instructions. To avoid complicated procedural analysis, this is computationally expensive. Thus, function call level information is extracted that can capture the behavior of the android application. This subsection focuses on sensitive function call sequences and entries features extracted from malware.
2.1 extraction of Intents features
The objects are used as a complex message communication system in an android operating system, the communication between the interior of an android application program and the application program is mainly completed through the objects, and the objects provide an abstract definition for the operation executed by the application program. Intets consists of three components: actions, categories, and data. Action components describe the type of operation to be performed, such as MAIN, CALL, BATTERY LOW, SCREEN ON, and EDIT. Entries need to specify the categories to which they belong, such as launchr, BROWSABLE, and GADGET. The data component provides the necessary data for the operating component. For example, a CALL operation requires a telephone number, and an EDIT operation requires a document or HTTP URL to complete an action. The entries component of the android application has rich semantic information, and compared with static characteristics such as permissions, the entries can identify malware more accurately [19 ].
The method takes all intents contained in android Manifest xml in the android APK file as a feature set. Android malware often listens for certain specific intents to directly trigger malicious behavior. A typical example of android malware using Intent is BOOT _ COMPLETED, which is used to trigger malicious activity directly after a reboot of a smart device. Xml is an unreadable file, it needs to be parsed using the androgrard reverse tool and the entries features extracted from it.
However, intets is one of the valid features to identify android malicious applications. They have experimentally demonstrated that using the entries feature alone to identify malware is not the best solution, and the entries feature should be combined with other features. Thus, the present invention creates another alternative to sensitive function calls as another class of features.
2.2 extraction of sensitive function Call sequences from function Call graphs
The present invention is created with attention to features extracted from a Function Call Graph (FCG). The main reason for this is that the FCG better retains structural information in the binary file, e.g., compared to the n-gram feature. In addition to containing information about malware code in the form of functions and their code, they also contain information about interactions between functions. Static features based on function call graphs provide a powerful representation of malware and have been successfully used to detect malware on Windows systems.
Android APKs are typically written using the Java language, with Java source code being compiled into classes. Thus, the bytecode file is also processed in reverse using android, and a function call graph is generated using script android.
The android malware executes malicious activities on the private data by triggering sensitive functions. It is crucial for malware that the sensitive function calls in the function call graph are the ones that specify which malicious operations are to be performed by the malware. Malware and its variants of the same family generally behave similarly, that is, they call some similar sensitive functions. For example, the getVoiceMailNumber () method is called by the malware of the genimi family, which is a type of bot-like malware that is mainly used to steal personal privacy information and send it to a remote server. Malware calls very often sensitive functions. For example, setwifienable () function is used to launch WiFi, which may result in application updates without user permission, resulting in traffic used by the user being used in excess; runtime.exec () function is used to execute external commands, which may cause information of the user to be leaked, or may install some malicious software to the user; SendTextMessage (), SendBroadcast (), SendDataMessage () functions are used to send and receive SMS/MMS messages; the getDeviceId (), getSimStalkNumber () methods are used to access sensitive information on the handset.
Since direct analysis of the FCG is time consuming and computing resource intensive, it typically contains thousands of nodes, and thus the present invention creates a pre-processing of the FCG. The FCG is reduced to one containing only sensitive function calls by filtering the FCG through the Androguard reverse tool. This approach both preserves the malicious behavior of the android application while reducing the complexity of the FCG. In order to keep the relation between function calls in the sensitive function call graph, the invention adopts a graph traversal algorithm to traverse the sensitive function call graph, extracts a sensitive function call sequence from the graph, and takes the extracted sensitive function call sequence as training data.
4.3Androgru model
Based on the relevance among the sensitive function calling sequence features, the invention combines the text similarity principle with the GRU structure, improves the internal structure of the GRU structure by analyzing the threshold mechanism of the GRU structure, and provides an android malicious software detection model based on the GRU, namely an android GRU model.
Text similarity is the similarity between two texts calculated by a mathematical formula. The text similarity theory is widely applied in the fields of text classification, text clustering and the like, and the most common similarity measurement method is Euclidean distance which represents the similarity of two objects by calculating the distance between the two objects. Secondly, the Cosine similarity measure is used for calculating the included angle between two vectors, and is widely used in the fields of text classification and the like. Both Euclidean distance and Cosine similarity measures are common methods in machine learning and pattern recognition. The calculation formula of Euclidean similarity measurement and Cosine similarity measurement is as follows:
simEuc(ht-1,ht)=[(ht-1-ht)·(ht-1-ht)]1/2(1)
wherein: h istAnd ht-1Is a vector of the same dimension, | | h | |, is the length of h, ht`ht-1Is a dot product.
For android malware, different malware in the same family generally calls some common sensitive functions, and certain call relations exist among the sensitive functions. However, the sensitive function sequences obtained by the graph traversal algorithm preserve this calling relationship. Meanwhile, sensitive functions called in the malicious software of the same family have certain similarity, and the relevance between two adjacent characteristics is described by the text similarity principle.
Since the sensitive function calling sequence and the entries features are both text type data, the invention selectively uses the recurrent neural network model GRU for modeling. The input data and the hidden state of the GRU structure contain different information, and the reset gate and the update gate together control the information transfer process in the GRU structure. Therefore, the present invention creates similarity calculations only on the inputs (input data and hidden state) of the reset gate and the update gate, which is more able to pass as much information into the interior of the GRU structure as possible.
3.1 similarity calculation based on input data
Input data x of GRUtThe vectorization representation of the original data is performed, similarity calculation is performed on input data of two adjacent GRU units, and a similarity can be obtained: s ═ sim (x)t-1,xt). Based on the relationship between the information theory and the text similarity introduced above, x can be obtained through the similarityt-1And xtDifference information between: Δ x ═ 1-s) xt-1. For the sake of distinction, the GRU structure for similarity calculation on input data is named InputGRU, as shown in fig. 2.
Wherein the position pointed by the wide arrow is based on the input data xtAnd (4) calculating the similarity. Since both the reset gate and the update gate control the information transfer process in the GRU structure, the difference information between the input data of two adjacent GRU units is input to the reset gate and the update gate together with the current input data, the information transfer process is controlled by them, and more abstract information is learned therefrom:
zt=σ(Wzxt+UΔx(1-sim(xt-1,xt))xt-1+Uzht-1+bz) (3)
rt=σ(Wrxt+UΔx(1-sim(xt-1,xt))xt-1+Urht-1+br) (4)
however, for candidate states
Still select to input data x
tAs input, to keep the input information for the current time step:
hidden state ht is information learned from the input data and the hidden state. Therefore, at time step t, the hidden state ht of the InputGRU is calculated as follows:
3.2 similarity calculation based on hidden states
Hidden state h of GRU input at time step tt-1Contains all information of the input data of the first t-1 time steps. Similarly, at time step t-1, hidden state h of GRU inputt-2Contains all information of the input data of the first t-2 time steps. Similar to similarity calculation based on input data, this subsection is for ht-1And ht-2Similarity calculation was performed: s ═ sim (h)t-2, ht-1) And the difference information of the two is as follows: Δ h ═ 1-s) ht-2. The GRU structure for similarity calculation for hidden states is named HiddenGRU as shown in fig. 3.
Wherein, the position pointed by the wide arrow is the similarity calculation based on the hidden state. Hidden state h output to the first two GRU units of the current time step tt-2、ht-1Calculating the similarity to obtain the difference information between the two, and combining the difference information with the hidden state h currently inputt-1Input into reset and update gates:
rt=σ(Wrxt+Urht-1+UΔh(1-sim(ht-2,ht-1))ht-2+br) (7)
zt=σ(Wzxt+Uzht-1+UΔh(1-sim(ht-2,ht-1))ht-2+bz) (8)
however, for candidate states
And hidden state h
tThe calculation formula of the two is the same as that in the InputGRU structure.
4.3.3 similarity calculation based on input data and hidden states
The similarity calculation is carried out on the input data and the hidden state respectively in the first two subsections, and because the information contained in the input data and the hidden state is different, the similarity calculation is carried out on the input data and the hidden state simultaneously in the GRU structure, so that more abstract information can be learned from the data. Similarity calculations are performed simultaneously on the input data and hidden states within the GRU structure, which is named InputHiddenGRU, as shown in fig. 4.
The positions pointed by the wide black arrows are similarity calculation based on the hidden state, and the positions pointed by the wide gray arrows are similarity calculation based on the input data. The calculation formulas for the reset gate and the update gate are as follows:
rt=σ(Wrxt+UΔx(1-sim(xt-1,xt))xt-1+Urht-1+UΔh(1-sim(ht-2,ht-1))ht-2+br) (9)
zt=σ(Wzxt+UΔx(1-sim(xt-1,xt))xt-1+Uzht-1+UΔh(1-sim(ht-2,ht-1))ht-2+bz) (10)
in summary, for the three different GRU structures, the following formula can be used for abstract representation:
NewGRU=i*InputGRU+j*HiddenGRU,{i,j}∈{0,1} (11)
wherein NewGRU represents an abstract definition of the GRU structure. Since i and j can only take values between 0,1, the above formula can represent 4 different GRU structures as follows:
when i is 0 and j is 0, the structure will degenerate into a native GRU structure;
when i is 1 and j is 0, the structure is an InputGRU structure, and similarity calculation is carried out on input data only;
when i is 0 and j is 1, the structure is a HiddenGRU structure, and similarity calculation is carried out on the hidden state only;
when i is 1 and j is 1, the structure is an InputHiddenGRU structure, and similarity calculation is simultaneously carried out on input data and a hidden state in the GRU structure.
3.4Androgru model
Based on two different static characteristics of the entries and the sensitive function calling sequence, the extracted static characteristics are combined with the improved GRU structure, and the invention provides an android malicious software detection model-android GRU model based on the GRU, as shown in FIG. 5:
wherein, the GRU of the cycle layer in the model can use one of the three GRU structures proposed in this section. The model respectively uses a GRU model to train different characteristics, learned information is combined through a full connection layer, and finally prediction is carried out through a SoftMax layer, so that whether unknown android application software is malicious or not is judged.