CN113901465A

CN113901465A - Heterogeneous network-based Android malicious software detection method

Info

Publication number: CN113901465A
Application number: CN202111034077.7A
Authority: CN
Inventors: 崔艳鹏; 胡建伟; 于昆
Original assignee: Chengdu Xidian Network Security Research Institute; Xi'an Humen Network Technology Co ltd
Current assignee: Chengdu Xidian Network Security Research Institute; Xi'an Humen Network Technology Co ltd
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2022-01-07
Also published as: CN116010947A

Abstract

The invention discloses an Android malicious software detection method based on a heterogeneous network, which comprises the steps of decompiling all test samples by using Apktool through a compiling module; extracting various API calling information as matrix construction elements by combining the characteristics of the node and edge relation in the heterogeneous network; using a BM25 method to reduce the dimension of API calling information, and then constructing a matrix; and the sampling classification module adopts a random walk method aiming at the matrix, takes the node sequence obtained by the walk as the characteristic of the Skip-gram model for embedding and representing, and classifies and detects the malicious software by using an SVM algorithm. The invention carries out random walk on the matrix through the predefined meta-path, and constructs a node sequence containing rich semantics; and an embedded value is generated by using a Skip-gram method aiming at a plurality of wandering sequences of different application nodes, so that the malicious software is detected and classified.

Description

Heterogeneous network-based Android malicious software detection method

Technical Field

The invention belongs to the technical field of network security, and relates to an Android malicious software detection method based on a heterogeneous network.

Background

Android is the most widely used mobile equipment intelligent platform in the world, and is an open platform developed based on Linux. Since Google introduced the first-generation Android operating system to date in 2007, the market share of mobile devices has been increasing rapidly in nearly ten years, and with the increase of the popularity of smart devices, various Applications (APPs) that come online are increasing, and a wide range of APPs with different functions can meet various daily needs of people, but the potential safety hazard is also gradually increased.

The activity of Android malware in 2020 is higher than that expected in the end of 2019, and the detection records of some known types of malware are increased remarkably, wherein the records include trojan horse viruses, false advertisement information, bank false software and the like. These software may require, first, that the rights be displayed on other applications and may also require that unknown applications be allowed to be installed from unknown sources. Once these rights are accepted by the user, the malware may display advertisements on other applications and install malware from third party application stores. Within minutes of obtaining a license, an advertisement may appear in various forms: opening a default web browser to enter an advertising website; popping up an advertisement in the notification; even falsify the message notification bar, and the user can open the advertisement when clicking unknowingly.

Most work models information networks as homogeneous information networks (homogeneous networks for short), i.e. networks contain only objects and links of the same type, such as social networks and circle of friends. The homogeneous network modeling method usually extracts only part of information in an actual interactive system, and has no heterogeneity of distinguishing objects and relationships between the objects, thereby causing irreversible information loss. In the traditional Android behavior modeling method, only the interconnection relation among API calls is concerned, but rich semantics among the API calls, such as package names of the APIs, code blocks appearing in the APIs, and the like are ignored.

In recent years, more researchers model various types of interconnected networked data into heterogeneous information networks (called heterogeneous networks for short), so that more complete and natural abstraction of the real world is realized; if the information is combined with the API calling sequence, the semantics of the program modeled by the method of the heterogeneous graph are richer, so that the Android malicious software detection method modeled by the heterogeneous graph becomes a research hotspot. Ye et al propose HinDroid, propose construct a heterogeneous graph at first and model the complicated relation between API and Android application program, this method is through setting APK and API as the node, APK and API many relations set as the side, utilize heterogeneous network modeling approach, comb the structure information between API call, construct three kinds of relations, include relation (A), code block relation (B) and packet relation (P) separately, there are four APK nodes and four API nodes, API that use in APK is represented by the ellipse, and there are code block relation and packet relation between API and API, define APK and API as the node; and defining three forms of inclusion, code blocks and packages as edges, namely constructing a heterogeneous graph related to the Android software.

Based on the constructed Android heterogeneous graph, using a Skip-gram model to maximize the conditional probability of each application node, and using a method of the Android heterogeneous graph to maximize the conditional probability of each application node_iIs formed by a plurality of wandering sequences_t(a_i) Learning about application node a in Android heterogeneous network by maximizing network likelihood conditional probability_iAnd (4) characterization of (1). Finally, classifying the embedded values generated by the Skip-gram model by using an SVM method, and realizing the detection and classification of different types of malicious software by using a supervision method; but since no confusion is distinguished, a great deal of overhead is incurred.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an Android malicious software detection method based on a heterogeneous network, so that the detection overhead is reduced, and the monitoring accuracy is improved.

The invention is realized by the following technical scheme:

a heterogeneous network-based Android malware detection method comprises the following operations:

1) decoding and decompiling the sample to be tested through a compiling module so as to extract Samli codes of all the test samples;

2) identifying and sorting features in the Samli code through a feature construction module, and extracting various API calling information from a test sample based on the characteristics of the node and edge relation in a heterogeneous network;

then, dimension reduction is carried out on the extracted API calling information through a BM25 method, confusion semantics are filtered, API calling information with influence is screened out, and then the API calling information is used as an element to generate a feature matrix;

3) and a detection and classification module applies a defined meta-path capable of expressing semantic features to carry out random walk on the feature matrix, a walk sequence obtained by the random walk is used as the feature of the Skip-gram model to carry out embedded representation, and then an SVM method is used for detecting the malicious software in the Skip-gram model.

And the compiling module decodes and decompiles all test samples through an Apktool tool, only keeps all files with the suffix name of the smali, and deletes the rest files.

The feature extraction of the Samli code by the feature construction module is to extract an API (application program interface) from a Smal code and an inclusion relation, a code block relation or a package relation existing between an APK (android package) and the API, and to represent a connection path of the APK in a heterogeneous network by a meta path;

the inclusion relationship is as follows: API calls initiated with invoke in single APK decompiled Smali code;

the code block relationship is as follows: a pair of API calls that occur between a pair of ". method" and ". end method" in the Smali code;

the package relationship is: a pair of API calls in the Smali code that occur with the same package name;

when the inclusion relation, the code block relation, and the packet relation are abbreviated as a relation, B relation, and P relation, respectively, the specific path where the meta-path P is (ABA) is P (APK)₁API_iAPI_jAPK₂) (ii) a It indicates that there is an APIi call in APK1, an APIj call in APK2, and a P relationship between APIi and APIj.

The BM25 method is used for reducing the dimension of the extracted API call information, screening and removing encryption or confusion call in the features, namely removing the API call information containing special characters or incomplete characters;

the screening of the API call information with influence comprises the following operations:

101) extracting all aggregated API documents subjected to APK decompiling as a document set D, taking the API extracted by each APK in the D as an individual document D, and regarding a single API in the D as a morpheme q;

102) segmenting morphemes to obtain words t, and calculating the inverse document frequency through d and t to obtain the weight of the morphemes q;

103) calculating the relevance score of the morpheme q and the document d until the word in the current morpheme is calculated;

104) repeating 102) and 103) until the score of the relevance of each morpheme in each document d is calculated, and performing weighted summation on all the words; several APIs with higher summation results are taken as API call information with influence in the individual document d.

The generation of the feature matrix is as follows:

generating a Contain matrix for the Contain information in the characteristics, and recording the matrix as M_AA matrix, wherein each element:

when the value is 1, API call exists in the APK, otherwise, the value is 0;

generating a CodeBlock matrix for CodeBlock information in the characteristics, and recording the CodeBlock matrix as M_BA matrix, wherein each element:

M_Bij＝b_ij∈{0，1}

a value of 1 indicates that the pair of API calls belong to the same code block;

generating a Package matrix for the Package information in the characteristics, and recording the matrix as M_PA matrix, wherein each element:

a value of 1 indicates that the pair of API calls belong to the same packet.

The random walk is performed by adopting a Matrix2Vec random walk sampling method, the random walk according to the rules aiming at all applications in the application sequences is completed by inputting the application program sequences and the walk rule sequences, and the random walk sampling sequence of each application program is finally output;

and (3) maximizing the conditional probability of each application node by using a Skip-gram model for the finally generated random walk sampling sequence: by applying a at each application_iIs formed by a plurality of wandering sequences_t(a_i) The method has the advantages of maximizing the conditional probability and learning about the application node a in the Android heterogeneous network_iAnd (4) characterization of (1).

The Matrix2Vec random walk sampling method comprises the following operations:

1) will apply sequence a₁,a₂…a_i…a_t…a_nAs input, setting the number m of wandering times for each application;

2) inputting a sequence of wander rules

For the feature matrix M_AApplication a in_iRandomly walk a corresponding row and column value M_ij(ii) a Wherein M is_ijThe value of (A) represents the application a_iWhether the corresponding relation exists in the matrix M or not, the matrix M exists when the value is 1, and the matrix M does not exist when the value is 0;

3) if and only if the application sequence and the walk rule sequence are both empty, m · n random walk sample sequences are output.

The maximum conditional probability is that the regression of the defined Softmax function predicts that the conditional probability belonging to the class c is p (c) under the given sample theta_tL v; theta) in which c_tC categories are shown in theta, and v represents a current vertex, namely the starting point of the wandering sequence;

the maximum conditional probability calculation formula is as follows:

the classification and detection of malware using the SVM method is:

and projecting the embedded values represented by the Skip-gram model onto a two-dimensional plane, classifying the embedded values generated by the Skip-gram model by using an SVM (support vector machine) method, and distinguishing benign software and malicious software based on software characteristics.

Compared with the prior art, the invention has the following beneficial technical effects:

aiming at the problem that the character string form features are easy to be tampered by a confusion technology, in order to improve the robustness of the features, the method extracts the features based on the characteristics of the relation between the nodes and the edges in the heterogeneous network; further, by analyzing the occurrence frequency of normal semantics and confusion semantics in the features, the BM25 method is provided for filtering the confusion semantics, and a group of features with higher accuracy is screened out; and aiming at the problems of high cost, time consumption and memory consumption of high-dimensional feature storage, different storage methods are provided according to the characteristics of different matrixes, and new matrix storage is adopted, so that the model effect is improved to a certain extent.

Aiming at the problems that the walk model cannot identify semantics, the constructed corpus increases the burden of the model and the like; the invention adopts a random walk improvement method Matrix2Vec based on meta-path, carries out random walk on the Matrix through the predefined meta-path, and constructs a node sequence containing rich semantics; and an embedded value is generated by using a Skip-gram method aiming at a plurality of wandering sequences of different application nodes, so that the malicious software is detected and classified.

Drawings

FIG. 1 is a diagram of heterogeneous network modeling based on APK and API;

FIG. 2 is a schematic view of the detection process of the present invention;

FIG. 3 is a flow chart of a random walk sampling method of the present invention;

FIG. 4 is a graph of detection accuracy of the present invention;

FIG. 5 is a graph showing the comparison of the model efficiencies of the present invention and the HinDroid method.

Detailed Description

The present invention will now be described in further detail with reference to the following examples, which are intended to be illustrative, but not limiting, of the invention.

Referring to fig. 1 and 2, a heterogeneous network-based Android malware detection method includes the following operations:

Each step is specifically described below.

1. Decoding and decompiling

Android applications use API calls to access operating system functions and system resources, and therefore, they can serve as representations of Android application behavior. To extract the API calls, the Android application is re-decompressed to provide a DEX file, which is then further decompiled into Smali code using the reverse engineering tool Apktool.

The compiling module uses Apktool to decompile all test samples as:

and (4) decompiling all test samples by using an Apktool tool, only keeping all files with suffix names of the Smali, and deleting other redundant files.

2. Feature extraction

Taking the heterogeneous network formalized in fig. 1 as an example, there are four APK nodes and four API nodes (all expressed by circles), the APIs used in the APK are expressed by ellipses (inclusion relationship), and there are a code block relationship and a packet relationship between the APIs, and the APK and the API are defined as nodes, that is: three forms of an inclusion, a code block and a packet are defined as a relation and can also be called an edge.

Two objects in the heterogeneous network can be connected through different meta-paths, and the meta-paths essentially extract the substructures of the heterogeneous network and embody rich semantic information contained in the paths, so that the method becomes a basic semantic capture method in heterogeneous network analysis.

Therefore, the feature extraction of the Samli code by the feature construction module is to extract an API (application programming interface) from the Smal code and an inclusion relation, a code block relation or a package relation existing between the APK and the API, and characterize a connection path of the APK in the heterogeneous network by a meta path;

if the three relations including the code block and the packet are abbreviated as the relation a, the relation B and the relation P, the meta path P is the specific path P of (ABA) ═ APK₁API_iAPI_jAPK₂) Indicating that there is an APIi call in APK1, an APIj call in APK2, and a P-relationship between APIi and APIj.

3. Feature screening

The converted Smali code can be analyzed as API call extraction, because the generated API call is directly extracted, whether the API is encrypted or confused is not judged, and according to the observation of an extraction table: there are only 8 million API calls (there are duplicates) in the Smali code file on 10 decompiled APKs, appearing in the form of "La/j; the encryption or obfuscated call of i' is unreasonable for subsequent matrix construction and time overhead.

In order to solve the problems, the BM25 method is adopted to perform dimension reduction on the features, and the encryption or confusion call in the features is filtered and removed: such as a screening removal pattern La/j $ a; - > < init >, La/j; i, API calls with special characters or incomplete characters.

Based on the BM25 algorithm, the screening steps are as follows:

1) and extracting all summary API documents after APK decompilation as a document set D, taking the API extracted by each APK in the D as an individual document D, and taking a single API in the D as a morpheme q.

2) And segmenting morphemes to obtain words t, and calculating the inverse document frequency through d and t to obtain the morpheme q weight.

3) And calculating the relevance scores of the morpheme q and the document d until the word in the current morpheme is calculated.

4) Repeating the steps 2 and 3 until the calculation of the relevance score of each morpheme in each document d is finished, and performing weighted summation on all words; several APIs with higher summation results are taken as API call information with influence in the individual document d.

4. Matrix generation

After the feature improvement, the dimension of the feature is greatly reduced, so that for storing the feature, a related definition using matrix storage is given.

Aiming at the Contain information, a Contain matrix is generated and is marked as M_AA matrix, wherein each element:

a value of 1 indicates the presence of an API call in the APK, whereas a value of 0 is the opposite.

Generating a CodeBlock matrix for CodeBlock information, and recording the CodeBlock matrix as M_BA matrix, wherein each element:

a value of 1 indicates that the pair of API calls belong to the same code block.

Generating a Package matrix for the Package informationTo M_PA matrix, wherein each element:

a value of 1 indicates that the pair of API calls belong to the same packet.

By constructing these matrices, a feature representation is constructed that contains rich information.

5. Random walk

Random walk is carried out by adopting a Matrix2Vec random walk sampling method, the random walk according to the rules aiming at all applications in the application sequences is completed by inputting the application program sequences and the walk rule sequences, and the random walk sampling sequence of each application program is finally output;

the Matrix2Vec random walk sampling method can complete random walk according to rules aiming at all applications in an application sequence by inputting the application sequence and a walk rule sequence, and finally outputs the walk sequence of each application program. The random walk sampling method changes the walk value of the random walk algorithm from a random value to a determinable range, further enhances the stability of the random walk algorithm, and simultaneously expands the application range of the original method. The flow chart of the method is shown in FIG. 3, and the following steps are briefly described as follows:

1) will apply sequence a₁,a₂…a_i…a_t…a_nAs input, and sets the number of walks m for each application.

2) Inputting a sequence of wander rules

For matrix M_AApplication a in_iRandomly walk a corresponding row and column value M_ij. Wherein M is_ijThe value of (A) represents the application a_iWhether there is a correspondence in the matrix M, i.e. a value of 1 indicates presence, and a value of 0 indicates absence.

Maximizing the conditional probability of each application node by using a Skip-gram model for the finally generated m & n random walk sampling sequences, wherein a is applied to each application_iIs formed by a plurality of wandering sequences_t(a_i) Learning about application node a in Android heterogeneous network by maximizing network likelihood conditional probability_iAnd (4) characterization of (1).

SoftMax can also be called multiclass Logistic regression, which is equivalent to the popularization of Logistic regression on multi-classification problems.

With the above Softmax function, for a multi-classification model, the conditional probability that Softmax regression predicts that it belongs to class c is defined as: p (c)_tL v; theta) in which c_tRepresenting that θ has c categories.

The maximum conditional probability formula is as follows:

6. detection and classification

And classifying the embedded values generated by the Skip-gram model by using an SVM (support vector machine) method, and realizing the detection and classification of different types of malicious software by using a supervision method.

An SVM method is used for classifying the embedded values generated by the Skip-gram model, a model is established through an SVM training algorithm, a new instance is distributed into one class or other classes, so that the class is a non-probability binary linear classification, the generalization error rate is low, the calculation cost is low, and the result is easy to explain.

SVMs can analyze data, recognize patterns, and perform classification and regression analysis. Given a group of training samples, each mark belongs to two categories, because the embedded value is obtained after the method Skip-gram is used, when the embedded value is projected onto a two-dimensional plane, because different software features are different, the benign and malicious software can be well distinguished by using the SVM method.

Specific examples are given below.

The experiment is operated in Intel Core i 5-4210U CPU @1.70GHz and 8.00GB memory, and the operating system is Windows 10 family edition. And extracting API calling sequence information in the sample through an Apktool decompilation module. And relevant feature extraction, storage, embedding and classification are realized through relevant functions of a Pandas library, a NumPy library and a Sklearn library.

The malicious application software set used in this experiment was CICMalDroid2020, where over 17,341 Android software samples were collected; and its samples cover five different categories: adware (Adware), bank malware (Banking), short message malware (SMS), risk software (Riskware), and Benign software (Benign). The following briefly introduces the different categories of malicious application features:

adware refers to advertising content that is typically hidden inside legitimate applications that have been infected (in third party markets) by malware. Adware continually pops up advertisements even if the victim tries to force a close of the application, as the library of advertisements used by malware will repeatedly perform a series of steps to keep the advertisements on impression. The advertising software can infect the device and infect it, forcing it to download a particular advertising software type and allowing an attacker to steal personal information.

Banking malware is an online Banking account that aims to mimic the original Banking application or the Banking Web interface to access a user. Most mobile bank malware is Trojan-based, aiming to penetrate the devices, steal sensitive detailed information (e.g., bank login names and passwords) and send the stolen information to a command and control (C & C) server.

SMS malware intercepts SMS payloads for attacks using SMS services as its operating medium. The attacker first uploads the malware to its hosting site to link with the SMS. They control attack instructions, i.e. sending malicious SMS, intercepting SMS and stealing data, using C & C servers.

Riskware refers to legitimate programs that may cause damage if a malicious user utilizes them. Thus, it can become any other form of malware, such as Adware or Lessovirus software, that can extend functionality by installing newly infected applications. It is noted that there is only one variation of this category, commonly labeled "Riskware" by VirusTotal.

All other applications not in the above category are considered Benign and scanning all Benign samples using VirusTotal verifies security. It contains the most complete captured static and dynamic features compared to other publicly available data sets. The following table illustrates the integration of dataset sample types, total number of types, and number of selections.

Type of sample	Total amount of type (one)	Selecting quantity (number)
			Benign	3,638	500
Adware	1,515	500
			Banking	2,506	500
SMS	4,822	500
			Riskware	2,546	0

In order to fully verify the effectiveness and the practicability based on random walk embedding, 100 times of random walk sampling (no repeated sampling) are carried out on the Matrix2Vec of the method on each test unit, which is equivalent to providing 100-dimensional characteristics for each test data, because each test data is also randomly extracted in a data set and has no specific association, and 100 times of sampling is the characteristic quantity which can better embody the software functions after testing.

To test the accuracy of the classification according to the invention, 5 classification experiments were first carried out for this purpose. The accuracy of bank malware, Adware malware and SMS malware under different meta-paths 'AA', 'ABA', 'APA', 'APBPA', 'ABPBA' are tested respectively in a Matrix2Vec walk model, as shown in FIG. 4. Based on the whole meta-paths, the detection accuracy of the bank-type malicious software is the highest, and the detection accuracy effect of the SMS-type malicious software is the lowest; based on a single element path, the highest accuracy can reach 92.7%; and the accuracy of the test is improved along with the increase of the complexity of the meta-path, so that the more complex meta-path can depict the characteristics of the malicious software. FIG. 5 is a diagram comparing the model efficiencies of the present invention and the HinDroid method, and it can be seen that the number of APIs used, the running memory consumption, and the time consumption are all significantly reduced when the same software amount is tested.

The embodiments given above are preferable examples for implementing the present invention, and the present invention is not limited to the above-described embodiments. Any non-essential addition and replacement made by the technical characteristics of the technical scheme of the invention by a person skilled in the art belong to the protection scope of the invention.

Claims

1. A heterogeneous network-based Android malicious software detection method is characterized by comprising the following operations:

2. The heterogeneous network-based Android malware detection method of claim 1, wherein the compiling module decodes and decompiles all test samples through an Apktool tool, only keeps all files with a suffix name of.smal, and deletes the rest files.

3. The heterogeneous network-based Android malware detection method of claim 1, wherein the feature construction module extracts the API from the Smali code and an inclusion relationship, a code block relationship or a package relationship existing between the APK and the API, and characterizes a connection path in the heterogeneous network by a meta path;

when the inclusion relation, the code block relation, and the packet relation are abbreviated as a relation, B relation, and P relation, respectively, the specific path where the meta-path P is (ABA) is P (APK)₁API_iAPI_jAPK₂) (ii) a It indicates the presence of APIi call in APK1, APThere is an APIj call in K2 and a P-relationship between APIi and APIj.

4. The heterogeneous network-based Android malware detection method of claim 1, wherein the BM25 method performs dimension reduction on the extracted API call information, and filters and removes encryption or confusion calls in the features, namely, removes API call information containing special characters or incomplete characters;

5. The heterogeneous network-based Android malware detection method of claim 1 or 3, wherein the feature matrix is generated by:

when the value is 1, API call exists in the APK, otherwise, the value is 0;

a value of 1 indicates that the pair of API calls belong to the same packet.

6. The heterogeneous network-based Android malware detection method of claim 1, wherein the random walk is performed by a Matrix2Vec random walk sampling method, and by inputting an application sequence and a walk rule sequence, the random walk according to rules for all applications in the application sequence is completed, and finally a random walk sampling sequence of each application is output;

7. The heterogeneous network-based Android malware detection method of claim 6, wherein the Matrix2Vec random walk sampling method comprises the following operations:

1) will apply sequence a₁，a₂…a_i…a_t…a_nAs input, setting the number m of wandering times for each application;

2) inputting a walking rule sequence R ═ R₁°R₂°…°R_l-1Needle, needleFor feature matrix M_AApplication a in_iRandomly walk a corresponding row and column value M_ij(ii) a Wherein M is_ijThe value of (A) represents the application a_iWhether the corresponding relation exists in the matrix M or not, the matrix M exists when the value is 1, and the matrix M does not exist when the value is 0;

8. The heterogeneous network-based Android malware detection method of claim 6, wherein the maximized conditional probability is that a regression defining Softmax function predicts a conditional probability p (c) for class c at a given sample θ_tL v; theta) in which c_tC categories are shown in theta, and v represents a current vertex, namely the starting point of the wandering sequence;

the maximum conditional probability calculation formula is as follows:

9. the heterogeneous network-based Android malware detection method of claim 1 or 6, wherein the classification and detection of malware using SVM method is: