CN113901465A - Heterogeneous network-based Android malicious software detection method - Google Patents

Heterogeneous network-based Android malicious software detection method Download PDF

Info

Publication number
CN113901465A
CN113901465A CN202111034077.7A CN202111034077A CN113901465A CN 113901465 A CN113901465 A CN 113901465A CN 202111034077 A CN202111034077 A CN 202111034077A CN 113901465 A CN113901465 A CN 113901465A
Authority
CN
China
Prior art keywords
api
matrix
heterogeneous network
random walk
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202111034077.7A
Other languages
Chinese (zh)
Inventor
崔艳鹏
胡建伟
于昆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Xidian Network Security Research Institute
Xi'an Humen Network Technology Co ltd
Original Assignee
Chengdu Xidian Network Security Research Institute
Xi'an Humen Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Xidian Network Security Research Institute, Xi'an Humen Network Technology Co ltd filed Critical Chengdu Xidian Network Security Research Institute
Priority to CN202111034077.7A priority Critical patent/CN113901465A/en
Publication of CN113901465A publication Critical patent/CN113901465A/en
Priority to CN202211074492.XA priority patent/CN116010947A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking

Abstract

The invention discloses an Android malicious software detection method based on a heterogeneous network, which comprises the steps of decompiling all test samples by using Apktool through a compiling module; extracting various API calling information as matrix construction elements by combining the characteristics of the node and edge relation in the heterogeneous network; using a BM25 method to reduce the dimension of API calling information, and then constructing a matrix; and the sampling classification module adopts a random walk method aiming at the matrix, takes the node sequence obtained by the walk as the characteristic of the Skip-gram model for embedding and representing, and classifies and detects the malicious software by using an SVM algorithm. The invention carries out random walk on the matrix through the predefined meta-path, and constructs a node sequence containing rich semantics; and an embedded value is generated by using a Skip-gram method aiming at a plurality of wandering sequences of different application nodes, so that the malicious software is detected and classified.

Description

Heterogeneous network-based Android malicious software detection method
Technical Field
The invention belongs to the technical field of network security, and relates to an Android malicious software detection method based on a heterogeneous network.
Background
Android is the most widely used mobile equipment intelligent platform in the world, and is an open platform developed based on Linux. Since Google introduced the first-generation Android operating system to date in 2007, the market share of mobile devices has been increasing rapidly in nearly ten years, and with the increase of the popularity of smart devices, various Applications (APPs) that come online are increasing, and a wide range of APPs with different functions can meet various daily needs of people, but the potential safety hazard is also gradually increased.
The activity of Android malware in 2020 is higher than that expected in the end of 2019, and the detection records of some known types of malware are increased remarkably, wherein the records include trojan horse viruses, false advertisement information, bank false software and the like. These software may require, first, that the rights be displayed on other applications and may also require that unknown applications be allowed to be installed from unknown sources. Once these rights are accepted by the user, the malware may display advertisements on other applications and install malware from third party application stores. Within minutes of obtaining a license, an advertisement may appear in various forms: opening a default web browser to enter an advertising website; popping up an advertisement in the notification; even falsify the message notification bar, and the user can open the advertisement when clicking unknowingly.
Most work models information networks as homogeneous information networks (homogeneous networks for short), i.e. networks contain only objects and links of the same type, such as social networks and circle of friends. The homogeneous network modeling method usually extracts only part of information in an actual interactive system, and has no heterogeneity of distinguishing objects and relationships between the objects, thereby causing irreversible information loss. In the traditional Android behavior modeling method, only the interconnection relation among API calls is concerned, but rich semantics among the API calls, such as package names of the APIs, code blocks appearing in the APIs, and the like are ignored.
In recent years, more researchers model various types of interconnected networked data into heterogeneous information networks (called heterogeneous networks for short), so that more complete and natural abstraction of the real world is realized; if the information is combined with the API calling sequence, the semantics of the program modeled by the method of the heterogeneous graph are richer, so that the Android malicious software detection method modeled by the heterogeneous graph becomes a research hotspot. Ye et al propose HinDroid, propose construct a heterogeneous graph at first and model the complicated relation between API and Android application program, this method is through setting APK and API as the node, APK and API many relations set as the side, utilize heterogeneous network modeling approach, comb the structure information between API call, construct three kinds of relations, include relation (A), code block relation (B) and packet relation (P) separately, there are four APK nodes and four API nodes, API that use in APK is represented by the ellipse, and there are code block relation and packet relation between API and API, define APK and API as the node; and defining three forms of inclusion, code blocks and packages as edges, namely constructing a heterogeneous graph related to the Android software.
Based on the constructed Android heterogeneous graph, using a Skip-gram model to maximize the conditional probability of each application node, and using a method of the Android heterogeneous graph to maximize the conditional probability of each application nodeiIs formed by a plurality of wandering sequencest(ai) Learning about application node a in Android heterogeneous network by maximizing network likelihood conditional probabilityiAnd (4) characterization of (1). Finally, classifying the embedded values generated by the Skip-gram model by using an SVM method, and realizing the detection and classification of different types of malicious software by using a supervision method; but since no confusion is distinguished, a great deal of overhead is incurred.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an Android malicious software detection method based on a heterogeneous network, so that the detection overhead is reduced, and the monitoring accuracy is improved.
The invention is realized by the following technical scheme:
a heterogeneous network-based Android malware detection method comprises the following operations:
1) decoding and decompiling the sample to be tested through a compiling module so as to extract Samli codes of all the test samples;
2) identifying and sorting features in the Samli code through a feature construction module, and extracting various API calling information from a test sample based on the characteristics of the node and edge relation in a heterogeneous network;
then, dimension reduction is carried out on the extracted API calling information through a BM25 method, confusion semantics are filtered, API calling information with influence is screened out, and then the API calling information is used as an element to generate a feature matrix;
3) and a detection and classification module applies a defined meta-path capable of expressing semantic features to carry out random walk on the feature matrix, a walk sequence obtained by the random walk is used as the feature of the Skip-gram model to carry out embedded representation, and then an SVM method is used for detecting the malicious software in the Skip-gram model.
And the compiling module decodes and decompiles all test samples through an Apktool tool, only keeps all files with the suffix name of the smali, and deletes the rest files.
The feature extraction of the Samli code by the feature construction module is to extract an API (application program interface) from a Smal code and an inclusion relation, a code block relation or a package relation existing between an APK (android package) and the API, and to represent a connection path of the APK in a heterogeneous network by a meta path;
the inclusion relationship is as follows: API calls initiated with invoke in single APK decompiled Smali code;
the code block relationship is as follows: a pair of API calls that occur between a pair of ". method" and ". end method" in the Smali code;
the package relationship is: a pair of API calls in the Smali code that occur with the same package name;
when the inclusion relation, the code block relation, and the packet relation are abbreviated as a relation, B relation, and P relation, respectively, the specific path where the meta-path P is (ABA) is P (APK)1APIiAPIjAPK2) (ii) a It indicates that there is an APIi call in APK1, an APIj call in APK2, and a P relationship between APIi and APIj.
The BM25 method is used for reducing the dimension of the extracted API call information, screening and removing encryption or confusion call in the features, namely removing the API call information containing special characters or incomplete characters;
the screening of the API call information with influence comprises the following operations:
101) extracting all aggregated API documents subjected to APK decompiling as a document set D, taking the API extracted by each APK in the D as an individual document D, and regarding a single API in the D as a morpheme q;
102) segmenting morphemes to obtain words t, and calculating the inverse document frequency through d and t to obtain the weight of the morphemes q;
103) calculating the relevance score of the morpheme q and the document d until the word in the current morpheme is calculated;
104) repeating 102) and 103) until the score of the relevance of each morpheme in each document d is calculated, and performing weighted summation on all the words; several APIs with higher summation results are taken as API call information with influence in the individual document d.
The generation of the feature matrix is as follows:
generating a Contain matrix for the Contain information in the characteristics, and recording the matrix as MAA matrix, wherein each element:
Figure BDA0003246415170000041
when the value is 1, API call exists in the APK, otherwise, the value is 0;
generating a CodeBlock matrix for CodeBlock information in the characteristics, and recording the CodeBlock matrix as MBA matrix, wherein each element:
MBij=bij∈{0,1}
a value of 1 indicates that the pair of API calls belong to the same code block;
generating a Package matrix for the Package information in the characteristics, and recording the matrix as MPA matrix, wherein each element:
Figure BDA0003246415170000042
a value of 1 indicates that the pair of API calls belong to the same packet.
The random walk is performed by adopting a Matrix2Vec random walk sampling method, the random walk according to the rules aiming at all applications in the application sequences is completed by inputting the application program sequences and the walk rule sequences, and the random walk sampling sequence of each application program is finally output;
and (3) maximizing the conditional probability of each application node by using a Skip-gram model for the finally generated random walk sampling sequence: by applying a at each applicationiIs formed by a plurality of wandering sequencest(ai) The method has the advantages of maximizing the conditional probability and learning about the application node a in the Android heterogeneous networkiAnd (4) characterization of (1).
The Matrix2Vec random walk sampling method comprises the following operations:
1) will apply sequence a1,a2…ai…at…anAs input, setting the number m of wandering times for each application;
2) inputting a sequence of wander rules
Figure BDA0003246415170000051
For the feature matrix MAApplication a iniRandomly walk a corresponding row and column value Mij(ii) a Wherein M isijThe value of (A) represents the application aiWhether the corresponding relation exists in the matrix M or not, the matrix M exists when the value is 1, and the matrix M does not exist when the value is 0;
3) if and only if the application sequence and the walk rule sequence are both empty, m · n random walk sample sequences are output.
The maximum conditional probability is that the regression of the defined Softmax function predicts that the conditional probability belonging to the class c is p (c) under the given sample thetatL v; theta) in which ctC categories are shown in theta, and v represents a current vertex, namely the starting point of the wandering sequence;
the maximum conditional probability calculation formula is as follows:
Figure BDA0003246415170000052
the classification and detection of malware using the SVM method is:
and projecting the embedded values represented by the Skip-gram model onto a two-dimensional plane, classifying the embedded values generated by the Skip-gram model by using an SVM (support vector machine) method, and distinguishing benign software and malicious software based on software characteristics.
Compared with the prior art, the invention has the following beneficial technical effects:
aiming at the problem that the character string form features are easy to be tampered by a confusion technology, in order to improve the robustness of the features, the method extracts the features based on the characteristics of the relation between the nodes and the edges in the heterogeneous network; further, by analyzing the occurrence frequency of normal semantics and confusion semantics in the features, the BM25 method is provided for filtering the confusion semantics, and a group of features with higher accuracy is screened out; and aiming at the problems of high cost, time consumption and memory consumption of high-dimensional feature storage, different storage methods are provided according to the characteristics of different matrixes, and new matrix storage is adopted, so that the model effect is improved to a certain extent.
Aiming at the problems that the walk model cannot identify semantics, the constructed corpus increases the burden of the model and the like; the invention adopts a random walk improvement method Matrix2Vec based on meta-path, carries out random walk on the Matrix through the predefined meta-path, and constructs a node sequence containing rich semantics; and an embedded value is generated by using a Skip-gram method aiming at a plurality of wandering sequences of different application nodes, so that the malicious software is detected and classified.
Drawings
FIG. 1 is a diagram of heterogeneous network modeling based on APK and API;
FIG. 2 is a schematic view of the detection process of the present invention;
FIG. 3 is a flow chart of a random walk sampling method of the present invention;
FIG. 4 is a graph of detection accuracy of the present invention;
FIG. 5 is a graph showing the comparison of the model efficiencies of the present invention and the HinDroid method.
Detailed Description
The present invention will now be described in further detail with reference to the following examples, which are intended to be illustrative, but not limiting, of the invention.
Referring to fig. 1 and 2, a heterogeneous network-based Android malware detection method includes the following operations:
1) decoding and decompiling the sample to be tested through a compiling module so as to extract Samli codes of all the test samples;
2) identifying and sorting features in the Samli code through a feature construction module, and extracting various API calling information from a test sample based on the characteristics of the node and edge relation in a heterogeneous network;
then, dimension reduction is carried out on the extracted API calling information through a BM25 method, confusion semantics are filtered, API calling information with influence is screened out, and then the API calling information is used as an element to generate a feature matrix;
3) and a detection and classification module applies a defined meta-path capable of expressing semantic features to carry out random walk on the feature matrix, a walk sequence obtained by the random walk is used as the feature of the Skip-gram model to carry out embedded representation, and then an SVM method is used for detecting the malicious software in the Skip-gram model.
Each step is specifically described below.
1. Decoding and decompiling
Android applications use API calls to access operating system functions and system resources, and therefore, they can serve as representations of Android application behavior. To extract the API calls, the Android application is re-decompressed to provide a DEX file, which is then further decompiled into Smali code using the reverse engineering tool Apktool.
The compiling module uses Apktool to decompile all test samples as:
and (4) decompiling all test samples by using an Apktool tool, only keeping all files with suffix names of the Smali, and deleting other redundant files.
2. Feature extraction
Taking the heterogeneous network formalized in fig. 1 as an example, there are four APK nodes and four API nodes (all expressed by circles), the APIs used in the APK are expressed by ellipses (inclusion relationship), and there are a code block relationship and a packet relationship between the APIs, and the APK and the API are defined as nodes, that is: three forms of an inclusion, a code block and a packet are defined as a relation and can also be called an edge.
Figure BDA0003246415170000071
Two objects in the heterogeneous network can be connected through different meta-paths, and the meta-paths essentially extract the substructures of the heterogeneous network and embody rich semantic information contained in the paths, so that the method becomes a basic semantic capture method in heterogeneous network analysis.
Therefore, the feature extraction of the Samli code by the feature construction module is to extract an API (application programming interface) from the Smal code and an inclusion relation, a code block relation or a package relation existing between the APK and the API, and characterize a connection path of the APK in the heterogeneous network by a meta path;
if the three relations including the code block and the packet are abbreviated as the relation a, the relation B and the relation P, the meta path P is the specific path P of (ABA) ═ APK1APIiAPIjAPK2) Indicating that there is an APIi call in APK1, an APIj call in APK2, and a P-relationship between APIi and APIj.
3. Feature screening
The converted Smali code can be analyzed as API call extraction, because the generated API call is directly extracted, whether the API is encrypted or confused is not judged, and according to the observation of an extraction table: there are only 8 million API calls (there are duplicates) in the Smali code file on 10 decompiled APKs, appearing in the form of "La/j; the encryption or obfuscated call of i' is unreasonable for subsequent matrix construction and time overhead.
In order to solve the problems, the BM25 method is adopted to perform dimension reduction on the features, and the encryption or confusion call in the features is filtered and removed: such as a screening removal pattern La/j $ a; - > < init >, La/j; i, API calls with special characters or incomplete characters.
Based on the BM25 algorithm, the screening steps are as follows:
1) and extracting all summary API documents after APK decompilation as a document set D, taking the API extracted by each APK in the D as an individual document D, and taking a single API in the D as a morpheme q.
2) And segmenting morphemes to obtain words t, and calculating the inverse document frequency through d and t to obtain the morpheme q weight.
3) And calculating the relevance scores of the morpheme q and the document d until the word in the current morpheme is calculated.
4) Repeating the steps 2 and 3 until the calculation of the relevance score of each morpheme in each document d is finished, and performing weighted summation on all words; several APIs with higher summation results are taken as API call information with influence in the individual document d.
4. Matrix generation
After the feature improvement, the dimension of the feature is greatly reduced, so that for storing the feature, a related definition using matrix storage is given.
Aiming at the Contain information, a Contain matrix is generated and is marked as MAA matrix, wherein each element:
Figure BDA0003246415170000091
a value of 1 indicates the presence of an API call in the APK, whereas a value of 0 is the opposite.
Generating a CodeBlock matrix for CodeBlock information, and recording the CodeBlock matrix as MBA matrix, wherein each element:
Figure BDA0003246415170000092
a value of 1 indicates that the pair of API calls belong to the same code block.
Generating a Package matrix for the Package informationTo MPA matrix, wherein each element:
Figure BDA0003246415170000093
a value of 1 indicates that the pair of API calls belong to the same packet.
By constructing these matrices, a feature representation is constructed that contains rich information.
5. Random walk
Random walk is carried out by adopting a Matrix2Vec random walk sampling method, the random walk according to the rules aiming at all applications in the application sequences is completed by inputting the application program sequences and the walk rule sequences, and the random walk sampling sequence of each application program is finally output;
the Matrix2Vec random walk sampling method can complete random walk according to rules aiming at all applications in an application sequence by inputting the application sequence and a walk rule sequence, and finally outputs the walk sequence of each application program. The random walk sampling method changes the walk value of the random walk algorithm from a random value to a determinable range, further enhances the stability of the random walk algorithm, and simultaneously expands the application range of the original method. The flow chart of the method is shown in FIG. 3, and the following steps are briefly described as follows:
1) will apply sequence a1,a2…ai…at…anAs input, and sets the number of walks m for each application.
2) Inputting a sequence of wander rules
Figure BDA0003246415170000101
For matrix MAApplication a iniRandomly walk a corresponding row and column value Mij. Wherein M isijThe value of (A) represents the application aiWhether there is a correspondence in the matrix M, i.e. a value of 1 indicates presence, and a value of 0 indicates absence.
3) If and only if the application sequence and the walk rule sequence are both empty, m · n random walk sample sequences are output.
Maximizing the conditional probability of each application node by using a Skip-gram model for the finally generated m & n random walk sampling sequences, wherein a is applied to each applicationiIs formed by a plurality of wandering sequencest(ai) Learning about application node a in Android heterogeneous network by maximizing network likelihood conditional probabilityiAnd (4) characterization of (1).
SoftMax can also be called multiclass Logistic regression, which is equivalent to the popularization of Logistic regression on multi-classification problems.
With the above Softmax function, for a multi-classification model, the conditional probability that Softmax regression predicts that it belongs to class c is defined as: p (c)tL v; theta) in which ctRepresenting that θ has c categories.
The maximum conditional probability formula is as follows:
Figure BDA0003246415170000102
6. detection and classification
And classifying the embedded values generated by the Skip-gram model by using an SVM (support vector machine) method, and realizing the detection and classification of different types of malicious software by using a supervision method.
An SVM method is used for classifying the embedded values generated by the Skip-gram model, a model is established through an SVM training algorithm, a new instance is distributed into one class or other classes, so that the class is a non-probability binary linear classification, the generalization error rate is low, the calculation cost is low, and the result is easy to explain.
SVMs can analyze data, recognize patterns, and perform classification and regression analysis. Given a group of training samples, each mark belongs to two categories, because the embedded value is obtained after the method Skip-gram is used, when the embedded value is projected onto a two-dimensional plane, because different software features are different, the benign and malicious software can be well distinguished by using the SVM method.
Specific examples are given below.
The experiment is operated in Intel Core i 5-4210U CPU @1.70GHz and 8.00GB memory, and the operating system is Windows 10 family edition. And extracting API calling sequence information in the sample through an Apktool decompilation module. And relevant feature extraction, storage, embedding and classification are realized through relevant functions of a Pandas library, a NumPy library and a Sklearn library.
The malicious application software set used in this experiment was CICMalDroid2020, where over 17,341 Android software samples were collected; and its samples cover five different categories: adware (Adware), bank malware (Banking), short message malware (SMS), risk software (Riskware), and Benign software (Benign). The following briefly introduces the different categories of malicious application features:
adware refers to advertising content that is typically hidden inside legitimate applications that have been infected (in third party markets) by malware. Adware continually pops up advertisements even if the victim tries to force a close of the application, as the library of advertisements used by malware will repeatedly perform a series of steps to keep the advertisements on impression. The advertising software can infect the device and infect it, forcing it to download a particular advertising software type and allowing an attacker to steal personal information.
Banking malware is an online Banking account that aims to mimic the original Banking application or the Banking Web interface to access a user. Most mobile bank malware is Trojan-based, aiming to penetrate the devices, steal sensitive detailed information (e.g., bank login names and passwords) and send the stolen information to a command and control (C & C) server.
SMS malware intercepts SMS payloads for attacks using SMS services as its operating medium. The attacker first uploads the malware to its hosting site to link with the SMS. They control attack instructions, i.e. sending malicious SMS, intercepting SMS and stealing data, using C & C servers.
Riskware refers to legitimate programs that may cause damage if a malicious user utilizes them. Thus, it can become any other form of malware, such as Adware or Lessovirus software, that can extend functionality by installing newly infected applications. It is noted that there is only one variation of this category, commonly labeled "Riskware" by VirusTotal.
All other applications not in the above category are considered Benign and scanning all Benign samples using VirusTotal verifies security. It contains the most complete captured static and dynamic features compared to other publicly available data sets. The following table illustrates the integration of dataset sample types, total number of types, and number of selections.
Type of sample Total amount of type (one) Selecting quantity (number)
Benign 3,638 500
Adware 1,515 500
Banking 2,506 500
SMS 4,822 500
Riskware 2,546 0
In order to fully verify the effectiveness and the practicability based on random walk embedding, 100 times of random walk sampling (no repeated sampling) are carried out on the Matrix2Vec of the method on each test unit, which is equivalent to providing 100-dimensional characteristics for each test data, because each test data is also randomly extracted in a data set and has no specific association, and 100 times of sampling is the characteristic quantity which can better embody the software functions after testing.
To test the accuracy of the classification according to the invention, 5 classification experiments were first carried out for this purpose. The accuracy of bank malware, Adware malware and SMS malware under different meta-paths 'AA', 'ABA', 'APA', 'APBPA', 'ABPBA' are tested respectively in a Matrix2Vec walk model, as shown in FIG. 4. Based on the whole meta-paths, the detection accuracy of the bank-type malicious software is the highest, and the detection accuracy effect of the SMS-type malicious software is the lowest; based on a single element path, the highest accuracy can reach 92.7%; and the accuracy of the test is improved along with the increase of the complexity of the meta-path, so that the more complex meta-path can depict the characteristics of the malicious software. FIG. 5 is a diagram comparing the model efficiencies of the present invention and the HinDroid method, and it can be seen that the number of APIs used, the running memory consumption, and the time consumption are all significantly reduced when the same software amount is tested.
The embodiments given above are preferable examples for implementing the present invention, and the present invention is not limited to the above-described embodiments. Any non-essential addition and replacement made by the technical characteristics of the technical scheme of the invention by a person skilled in the art belong to the protection scope of the invention.

Claims (9)

1. A heterogeneous network-based Android malicious software detection method is characterized by comprising the following operations:
1) decoding and decompiling the sample to be tested through a compiling module so as to extract Samli codes of all the test samples;
2) identifying and sorting features in the Samli code through a feature construction module, and extracting various API calling information from a test sample based on the characteristics of the node and edge relation in a heterogeneous network;
then, dimension reduction is carried out on the extracted API calling information through a BM25 method, confusion semantics are filtered, API calling information with influence is screened out, and then the API calling information is used as an element to generate a feature matrix;
3) and a detection and classification module applies a defined meta-path capable of expressing semantic features to carry out random walk on the feature matrix, a walk sequence obtained by the random walk is used as the feature of the Skip-gram model to carry out embedded representation, and then an SVM method is used for detecting the malicious software in the Skip-gram model.
2. The heterogeneous network-based Android malware detection method of claim 1, wherein the compiling module decodes and decompiles all test samples through an Apktool tool, only keeps all files with a suffix name of.smal, and deletes the rest files.
3. The heterogeneous network-based Android malware detection method of claim 1, wherein the feature construction module extracts the API from the Smali code and an inclusion relationship, a code block relationship or a package relationship existing between the APK and the API, and characterizes a connection path in the heterogeneous network by a meta path;
the inclusion relationship is as follows: API calls initiated with invoke in single APK decompiled Smali code;
the code block relationship is as follows: a pair of API calls that occur between a pair of ". method" and ". end method" in the Smali code;
the package relationship is: a pair of API calls in the Smali code that occur with the same package name;
when the inclusion relation, the code block relation, and the packet relation are abbreviated as a relation, B relation, and P relation, respectively, the specific path where the meta-path P is (ABA) is P (APK)1APIiAPIjAPK2) (ii) a It indicates the presence of APIi call in APK1, APThere is an APIj call in K2 and a P-relationship between APIi and APIj.
4. The heterogeneous network-based Android malware detection method of claim 1, wherein the BM25 method performs dimension reduction on the extracted API call information, and filters and removes encryption or confusion calls in the features, namely, removes API call information containing special characters or incomplete characters;
the screening of the API call information with influence comprises the following operations:
101) extracting all aggregated API documents subjected to APK decompiling as a document set D, taking the API extracted by each APK in the D as an individual document D, and regarding a single API in the D as a morpheme q;
102) segmenting morphemes to obtain words t, and calculating the inverse document frequency through d and t to obtain the weight of the morphemes q;
103) calculating the relevance score of the morpheme q and the document d until the word in the current morpheme is calculated;
104) repeating 102) and 103) until the score of the relevance of each morpheme in each document d is calculated, and performing weighted summation on all the words; several APIs with higher summation results are taken as API call information with influence in the individual document d.
5. The heterogeneous network-based Android malware detection method of claim 1 or 3, wherein the feature matrix is generated by:
generating a Contain matrix for the Contain information in the characteristics, and recording the matrix as MAA matrix, wherein each element:
Figure FDA0003246415160000021
when the value is 1, API call exists in the APK, otherwise, the value is 0;
generating a CodeBlock matrix for CodeBlock information in the characteristics, and recording the CodeBlock matrix as MBA matrix, wherein each element:
Figure FDA0003246415160000022
a value of 1 indicates that the pair of API calls belong to the same code block;
generating a Package matrix for the Package information in the characteristics, and recording the matrix as MPA matrix, wherein each element:
Figure FDA0003246415160000031
a value of 1 indicates that the pair of API calls belong to the same packet.
6. The heterogeneous network-based Android malware detection method of claim 1, wherein the random walk is performed by a Matrix2Vec random walk sampling method, and by inputting an application sequence and a walk rule sequence, the random walk according to rules for all applications in the application sequence is completed, and finally a random walk sampling sequence of each application is output;
and (3) maximizing the conditional probability of each application node by using a Skip-gram model for the finally generated random walk sampling sequence: by applying a at each applicationiIs formed by a plurality of wandering sequencest(ai) The method has the advantages of maximizing the conditional probability and learning about the application node a in the Android heterogeneous networkiAnd (4) characterization of (1).
7. The heterogeneous network-based Android malware detection method of claim 6, wherein the Matrix2Vec random walk sampling method comprises the following operations:
1) will apply sequence a1,a2…ai…at…anAs input, setting the number m of wandering times for each application;
2) inputting a walking rule sequence R ═ R1°R2°…°Rl-1Needle, needleFor feature matrix MAApplication a iniRandomly walk a corresponding row and column value Mij(ii) a Wherein M isijThe value of (A) represents the application aiWhether the corresponding relation exists in the matrix M or not, the matrix M exists when the value is 1, and the matrix M does not exist when the value is 0;
3) if and only if the application sequence and the walk rule sequence are both empty, m · n random walk sample sequences are output.
8. The heterogeneous network-based Android malware detection method of claim 6, wherein the maximized conditional probability is that a regression defining Softmax function predicts a conditional probability p (c) for class c at a given sample θtL v; theta) in which ctC categories are shown in theta, and v represents a current vertex, namely the starting point of the wandering sequence;
the maximum conditional probability calculation formula is as follows:
Figure FDA0003246415160000041
9. the heterogeneous network-based Android malware detection method of claim 1 or 6, wherein the classification and detection of malware using SVM method is:
and projecting the embedded values represented by the Skip-gram model onto a two-dimensional plane, classifying the embedded values generated by the Skip-gram model by using an SVM (support vector machine) method, and distinguishing benign software and malicious software based on software characteristics.
CN202111034077.7A 2021-09-03 2021-09-03 Heterogeneous network-based Android malicious software detection method Withdrawn CN113901465A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111034077.7A CN113901465A (en) 2021-09-03 2021-09-03 Heterogeneous network-based Android malicious software detection method
CN202211074492.XA CN116010947A (en) 2021-09-03 2022-09-03 Android malicious software detection method based on heterogeneous network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111034077.7A CN113901465A (en) 2021-09-03 2021-09-03 Heterogeneous network-based Android malicious software detection method

Publications (1)

Publication Number Publication Date
CN113901465A true CN113901465A (en) 2022-01-07

Family

ID=79188600

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202111034077.7A Withdrawn CN113901465A (en) 2021-09-03 2021-09-03 Heterogeneous network-based Android malicious software detection method
CN202211074492.XA Pending CN116010947A (en) 2021-09-03 2022-09-03 Android malicious software detection method based on heterogeneous network

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202211074492.XA Pending CN116010947A (en) 2021-09-03 2022-09-03 Android malicious software detection method based on heterogeneous network

Country Status (1)

Country Link
CN (2) CN113901465A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114662105A (en) * 2022-03-17 2022-06-24 电子科技大学 Method and system for identifying Android malicious software based on graph node relationship and graph compression
CN114756860A (en) * 2022-02-22 2022-07-15 广州大学 Malicious software detection method based on meta-path

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290238B (en) * 2023-10-10 2024-04-09 湖北大学 Software defect prediction method and system based on heterogeneous relational graph neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114756860A (en) * 2022-02-22 2022-07-15 广州大学 Malicious software detection method based on meta-path
CN114662105A (en) * 2022-03-17 2022-06-24 电子科技大学 Method and system for identifying Android malicious software based on graph node relationship and graph compression

Also Published As

Publication number Publication date
CN116010947A (en) 2023-04-25

Similar Documents

Publication Publication Date Title
Mahdavifar et al. Application of deep learning to cybersecurity: A survey
Hou et al. Droiddelver: An android malware detection system using deep belief network based on api call blocks
Lin et al. Identifying android malicious repackaged applications by thread-grained system call sequences
JP6736532B2 (en) System and method for detecting malicious files using elements of static analysis
CN113901465A (en) Heterogeneous network-based Android malicious software detection method
CN106599686A (en) Malware clustering method based on TLSH character representation
Nissim et al. Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework
CN108985064B (en) Method and device for identifying malicious document
Qiu et al. Cyber code intelligence for android malware detection
Wang et al. LSCDroid: Malware detection based on local sensitive API invocation sequences
CN109858248A (en) Malice Word document detection method and device
Iliou et al. Towards a framework for detecting advanced web bots
CN109614795A (en) A kind of Android malware detection method of event perception
Iliou et al. Detection of advanced web bots by combining web logs with mouse behavioural biometrics
Gorment et al. Machine learning algorithm for malware detection: taxonomy, current challenges and future directions
Brown et al. An artificial immunity approach to malware detection in a mobile platform
Casolare et al. Dynamic Mobile Malware Detection through System Call-based Image representation.
Kakisim et al. Sequential opcode embedding-based malware detection method
Valiyaveedu et al. Survey and analysis on AI based phishing detection techniques
CN111881446B (en) Industrial Internet malicious code identification method and device
Tsai et al. PowerDP: de-obfuscating and profiling malicious PowerShell commands with multi-label classifiers
CN116932381A (en) Automatic evaluation method for security risk of applet and related equipment
Cybersecurity Machine learning for malware detection
CN114817925A (en) Android malicious software detection method and system based on multi-modal graph features
Zhang et al. Survey on malicious code intelligent detection techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220107

WW01 Invention patent application withdrawn after publication