CN109271788B - Android malicious software detection method based on deep learning - Google Patents

Android malicious software detection method based on deep learning Download PDF

Info

Publication number
CN109271788B
CN109271788B CN201810963774.2A CN201810963774A CN109271788B CN 109271788 B CN109271788 B CN 109271788B CN 201810963774 A CN201810963774 A CN 201810963774A CN 109271788 B CN109271788 B CN 109271788B
Authority
CN
China
Prior art keywords
file
software
android
malicious
apk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810963774.2A
Other languages
Chinese (zh)
Other versions
CN109271788A (en
Inventor
罗森林
张寒青
潘丽敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201810963774.2A priority Critical patent/CN109271788B/en
Publication of CN109271788A publication Critical patent/CN109271788A/en
Application granted granted Critical
Publication of CN109271788B publication Critical patent/CN109271788B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Stored Programmes (AREA)

Abstract

The invention relates to an Android malicious software detection method based on deep learning, and belongs to the technical field of computers and information science. According to the method, the Android application software is subjected to feature extraction, and then the Android application files are subjected to operations such as decompression, decompiling and the like to extract related safety features. The extracted features include 3 aspects: the file structure characteristics, the safety experience characteristics and the N-gram statistical characteristics formed by the Dalvik instruction set. And then, carrying out numerical processing on the extracted features to construct feature vectors. And finally, constructing a DNN (deep Neural network) model based on the extracted relevant features. And classifying and identifying the new Android software through the constructed model. The method integrates the analysis of the instruction set, has the function of resisting the confusion of the malicious software, can enhance the feature learning by the detection of the malicious software based on the depth model, can well express the rich intrinsic information of big data, and is more easily suitable for the continuously evolved malicious software.

Description

Android malicious software detection method based on deep learning
Technical Field
The invention relates to an Android malicious software detection method based on deep learning, and belongs to the technical field of computers and information science.
Background
With the continuous development of mobile internet, intelligent terminals become an important part in the life of each person. Android, as the most widely used mobile operating system, causes a flood of malware due to its open and flexible ecological environment. How to effectively detect the Android malware is a research topic with important value. Currently, mainstream Android malicious code detection methods are roughly classified into static detection methods and dynamic detection methods.
1. Dynamic detection method
The dynamic detection and analysis is a method of extracting features and performing detection and analysis after running a program to be detected. The dynamic detection method mainly comprises the steps of operating an Android application file on an Android device, and then analyzing software by collecting data such as an API (application programming interface) calling sequence, resource use and the like in the software operation process. Although dynamic analysis has the advantage of being not influenced by limiting factors such as code shell adding and confusion, the method has the problems of difficult data acquisition and extraction, high software running cost, low code coverage rate, easiness in reverse detection of malicious software through detection of a running environment and the like in actual use. Therefore, detecting malware by a method of dynamic analysis is less used in practice.
2. Static state detection method
The static detection method mainly comprises the steps of scanning and analyzing the Android application file, and extracting security-related sensitive information and features, such as sensitive permission, system action, sensitive system calling and the like, in the Android file. These refined features are then analyzed and generalized and judged if they are malware. Compared with a dynamic analysis method, the static analysis method has higher code coverage rate and smaller time overhead, and can generally achieve better detection accuracy rate. The method is also a mainstream detection method for various virus checking and killing software at present. However, in an actual environment, an Android application developer often performs operations such as obfuscation and encryption to protect a code, and in such an environment, it is not easy to extract effective features through static analysis so as to make a misjudgment on the effective features. Meanwhile, the malware evolves and develops rapidly every year, and the conventional detection method is difficult to adapt to new malware which is continuously emerging.
In view of the above problems, the present subject matter proposes a method for classifying malware based on deep learning. On one hand, static characteristics common to some malicious software are extracted by analyzing the Android application file. On the other hand, Smalli source codes are extracted by performing decompiling on the Android application file, Dalvik operation codes are extracted from the Smalli source codes, and then the instruction set of the Smalli source codes is abstracted and the N-gram sequence characteristics are extracted. Finally, after the extracted features are normalized, abstract modeling is carried out through a deep learning algorithm to finish the identification of the malicious software. The detection system based on the instruction set analysis has the function of resisting the malicious software confusion. The feature learning can be enhanced by the malicious software detection based on the depth model, rich internal information of big data can be well expressed, and the method is more easily suitable for the continuously evolved malicious software.
Disclosure of Invention
The invention aims to solve the problems that a conventional Android malicious software detection method is low in detection accuracy, limited in detection application range and difficult to adapt to emerging software, and provides a malicious software detection method based on deep learning.
The design principle of the invention is as follows: firstly, extracting the characteristics of the Android application software. And extracting related safety features by carrying out operations such as decompression, decompiling and the like on the Android application file. The extracted features include 3 aspects: the file structure characteristics, the safety experience characteristics and the N-gram statistical characteristics formed by the Dalvik instruction set. And then, carrying out numerical processing on the extracted features to construct feature vectors. And finally, constructing a DNN (deep Neural network) model based on the extracted relevant features. And carrying out software classification and identification on the new Android through the constructed model.
The technical scheme of the invention is realized by the following steps:
step 1, obtaining an Android positive and negative sample file, and then preprocessing the file.
Step 1.1, 2452 malicious Android software libraries are obtained from http:// amd. argussab. org/bheviors, and 21000 normal software libraries are obtained from the Android market.
And step 1.2, decompressing each application software, and extracting files such as Android manifest.
Step 1.3, performing decompiling operation on the class.
And 2, extracting the characteristics of the Android application file.
And 2.1, extracting the characteristics of the file obtained in the step 1, wherein the extracted characteristics comprise file structure characteristics, safety experience characteristics and N-gram characteristics abstracted by a Dalvik instruction set.
And 2.2, digitizing the extracted features, and obtaining a feature vector of each application after normalization expression.
And 3, constructing a neural network classifier and identifying and detecting software.
And 3.1, dividing the test set and the data set according to the characteristic vectors of the database and training a neural network.
And 3.2, preprocessing and feature extraction are carried out on the new software, then classification is carried out on the basis of the constructed neural network classifier, and the software classification result is given to finish the detection of the software.
Advantageous effects
Compared with the traditional static analysis method, the method extracts more abundant software features including file structure features, experience features and N-gram statistical features based on the Dalvik instruction set. The characteristics can more comprehensively represent the characteristics of the Android software so as to achieve higher detection accuracy.
Compared with the traditional machine learning classification algorithm, the deep learning can enhance the feature learning, well express the abundant internal information of big data, and is more easily adapted to the continuously evolved malicious software.
Drawings
FIG. 1 is a schematic diagram of an Android malware detection method based on deep learning.
Detailed Description
In order to better illustrate the objects and advantages of the present invention, embodiments of the method of the present invention are described in further detail below with reference to examples.
The specific process is as follows:
step 1, obtaining an Android positive and negative sample file, and then preprocessing the file
Step 1.1, 2452 malicious Android software libraries are obtained from http:// amd. argussab. org/bheviors, and 21000 normal software libraries are obtained from the Android market.
And step 1.2, extracting files such as Android manual. xml files, res files, classes.dex files and the like of the Android application software for subsequent analysis from each application software through an Andguard tool.
Step 1.3, performing decompiling operation on the class.dex file by an Andguard tool, and then extracting Dalvik operation codes of each Smalli file.
And 2, extracting the characteristics of the Android application file.
And 2.1, performing feature extraction on the file obtained in the step 1. The first type is structured characteristics, including 63 dimensions such as the sensitive authority applied by the APK, the system action contained by the application, the activity contained by the application, the service, Broadcast Receive, and Content Provider number. The second is an empirical characteristic which mainly comprises an empirical summary characteristic of long-term malicious APK detection analysis, including whether the resource file contains an executable file, whether the assets folder contains an APK file, the number of image files contained in the resource file in the APK file, the number of functions with parameters larger than 20 and the like. Malicious code is hidden in the executable file with a high probability under the APK which contains the additional executable file in the general installation file. The malicious software has a smaller number of image files and has more parameters in order to avoid detecting the function with malicious tendency. The empirical characteristics amount to 4 dimensions. The third is the N-gram features abstracted by the Dalvik instruction set. According to analysis of the malicious software, codes for realizing malicious intentions of the malicious software are concentrated in a malicious file, a single file is used as a unit when N-gram characteristics are counted, and then the counted N-gram characteristics are weighted to serve as final characteristic vectors. The Dalvik instructions are first classified into 10 classes according to their functional characteristics, and the specific classification is shown in Table 1. Then, the symbol sequence after Dalvik instruction abstraction is counted for each Smalla file in the APK file. The sequence is then N-gram processed, where 3-gram is selected. For example, an APK file has n Smalla files, and each file has 1000-dimensional statistical characteristics and is marked as FnThe concrete form is shown as formula (1), wherein fnkRepresenting the k characteristic statistic value of the nth file.
Fn=[fn0,fn1,fn2……fn999] (1)
Then weighting the N1000-dimensional features and then performing normalization expression to obtain a new 1000-dimensional feature F as a final Dalvik bytecode N-gram statistical feature, wherein the specific form is shown in formula (2).
F=[k0,k1……km……k999] (2)
Figure BDA0001774430330000041
In the formula, kmAnd (4) representing the mth characteristic value in the new statistical characteristic.
TABLE 1 instruction symbol meaning definition
Figure BDA0001774430330000042
And 2.2, digitizing the features extracted in the step 2.1, and obtaining a feature vector of each application after normalization expression. Assuming a database D, the database can be represented by the following matrix.
Figure BDA0001774430330000051
The database D has n samples, the attribute of each sample has p dimensions, and each sample has a target value Y. Here, the target value is 0 or 1, 1 is represented as a positive sample, and 0 is represented as a negative sample. The characteristic dimension of each sample in the method is 1067, and n is 45120.
And 3, constructing a neural network classifier and identifying and detecting software.
And 3.1, dividing a test set and a training set according to the characteristic vectors of the database and training a neural network.
And 3.2, preprocessing and feature extraction are carried out on the new software, then classification is carried out on the basis of the constructed neural network classifier, and the software classification result is given to finish the detection of the software.
And (3) testing results: in the experiment, 45120 positive and negative samples (part of files are damaged and cannot be extracted) are selected in total, wherein 23511 malicious samples are selected, and 21609 normal software samples are selected. 5-fold cross validation was then performed on the entire dataset based on the constructed neural network model. The average accuracy, average recall, and average F-number of the final positive and negative samples are shown in table 2. The experimental data in the table show that the method has higher detection accuracy and can achieve better detection effect.
TABLE 2 test results
Figure BDA0001774430330000052
The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (1)

1. A deep learning-based Android malicious software detection method is characterized by comprising the following steps:
step 1, obtaining an Android positive and negative sample file, and then preprocessing the file, wherein the method comprises the following steps: selecting positive and negative Android application software, and decompressing the APK file to obtain all files in the APK file; performing decompiling processing on the class.dex file, and extracting Dalvik operation codes in each smalli file in each APK;
step 2, extracting the features of the file obtained in the step 1 to obtain each software feature vector, which comprises the following steps: processing the file obtained in the step 1 to obtain experience characteristics and structural characteristics of the APK file and N-gram statistical characteristics of a Dalvik instruction set, and performing digitization and normalization processing on the characteristics to obtain a characteristic vector of each piece of software, wherein the structural characteristics comprise 63 dimensions of sensitive authority applied by the APK, and the number of activity, service, Broadcast Receive and Content Provider contained in the application; the experience characteristics mainly comprise the characteristics of experience summary of long-term malicious APK detection analysis, including whether the resource file contains an executable file, whether the assets folder contains an APK file, the number of image files contained in the resource file in the APK file and the number of functions with parameters larger than 20 are 4-dimensional in total; the N-gram statistical characteristics of the Dalvik operation codes are 1000-dimensional, and considering that functional codes for realizing malicious intentions by malicious software are concentrated in a malicious file, a single Smalli file is taken as a unit when the N-gram characteristics are counted, and then the N-gram characteristics counted by each file are weighted and normalized to be used as a final characteristic vector;
step 3, constructing a classification model according to the data extracted in the step 2, evaluating the model on a data set by adopting a 5-fold cross validation method in the construction process, and finally constructing a neural network classifier and identifying and detecting software, wherein a fully-connected deep neural network is adopted when constructing a malicious software classification model, so that the model is suitable for processing the input of high-dimensional data; on the other hand, deep learning can enhance feature learning, 1067-dimensional features extracted by the APK can be correspondingly combined and transformed in the model learning process, and deep level relation among the features is automatically excavated so as to adapt to continuously evolved malicious software and realize higher malicious software detection accuracy.
CN201810963774.2A 2018-08-23 2018-08-23 Android malicious software detection method based on deep learning Active CN109271788B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810963774.2A CN109271788B (en) 2018-08-23 2018-08-23 Android malicious software detection method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810963774.2A CN109271788B (en) 2018-08-23 2018-08-23 Android malicious software detection method based on deep learning

Publications (2)

Publication Number Publication Date
CN109271788A CN109271788A (en) 2019-01-25
CN109271788B true CN109271788B (en) 2021-10-12

Family

ID=65154347

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810963774.2A Active CN109271788B (en) 2018-08-23 2018-08-23 Android malicious software detection method based on deep learning

Country Status (1)

Country Link
CN (1) CN109271788B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069927A (en) * 2019-04-22 2019-07-30 中国民航大学 Malice APK detection method, system, data storage device and detection program
CN110245493A (en) * 2019-05-22 2019-09-17 中国人民公安大学 A method of the Android malware detection based on depth confidence network
CN110363003B (en) * 2019-07-25 2022-08-02 哈尔滨工业大学 Android virus static detection method based on deep learning
CN110717182A (en) * 2019-10-14 2020-01-21 杭州安恒信息技术股份有限公司 Webpage Trojan horse detection method, device and equipment and readable storage medium
CN111460452B (en) * 2020-03-30 2022-09-09 中国人民解放军国防科技大学 Android malicious software detection method based on frequency fingerprint extraction
CN112966272B (en) * 2021-03-31 2022-09-09 国网河南省电力公司电力科学研究院 Internet of things Android malicious software detection method based on countermeasure network
CN113139189B (en) * 2021-04-29 2021-10-26 广州大学 Method, system and storage medium for identifying mining malicious software
CN113656308A (en) * 2021-08-18 2021-11-16 福建卫联科技有限公司 Computer software analysis system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102938040A (en) * 2012-09-29 2013-02-20 中兴通讯股份有限公司 Malicious Android application program detection method, system and device
US8826439B1 (en) * 2011-01-26 2014-09-02 Symantec Corporation Encoding machine code instructions for static feature based malware clustering
CN104376262A (en) * 2014-12-08 2015-02-25 中国科学院深圳先进技术研究院 Android malware detecting method based on Dalvik command and authority combination
CN105205396A (en) * 2015-10-15 2015-12-30 上海交通大学 Detecting system for Android malicious code based on deep learning and method thereof
CN106096405A (en) * 2016-04-26 2016-11-09 浙江工业大学 A kind of Android malicious code detecting method abstract based on Dalvik instruction
CN107169354A (en) * 2017-04-21 2017-09-15 北京理工大学 Multi-layer android system malicious act monitoring method
CN108304720A (en) * 2018-02-06 2018-07-20 恒安嘉新(北京)科技股份公司 A kind of Android malware detection methods based on machine learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100478953C (en) * 2006-09-28 2009-04-15 北京理工大学 Static feature based web page malicious scenarios detection method
CN107577942B (en) * 2017-08-22 2020-09-15 中国民航大学 Mixed feature screening method for Android malicious software detection

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8826439B1 (en) * 2011-01-26 2014-09-02 Symantec Corporation Encoding machine code instructions for static feature based malware clustering
CN102938040A (en) * 2012-09-29 2013-02-20 中兴通讯股份有限公司 Malicious Android application program detection method, system and device
CN104376262A (en) * 2014-12-08 2015-02-25 中国科学院深圳先进技术研究院 Android malware detecting method based on Dalvik command and authority combination
CN105205396A (en) * 2015-10-15 2015-12-30 上海交通大学 Detecting system for Android malicious code based on deep learning and method thereof
CN106096405A (en) * 2016-04-26 2016-11-09 浙江工业大学 A kind of Android malicious code detecting method abstract based on Dalvik instruction
CN107169354A (en) * 2017-04-21 2017-09-15 北京理工大学 Multi-layer android system malicious act monitoring method
CN108304720A (en) * 2018-02-06 2018-07-20 恒安嘉新(北京)科技股份公司 A kind of Android malware detection methods based on machine learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Investigating the android intents and permissions for malware detection;Fauzia Idrees 等;《2014 Seventh International Workshop on Selected Topics in Mobile and Wireless Computing》;20141010;第354-358页 *

Also Published As

Publication number Publication date
CN109271788A (en) 2019-01-25

Similar Documents

Publication Publication Date Title
CN109271788B (en) Android malicious software detection method based on deep learning
CN109753801B (en) Intelligent terminal malicious software dynamic detection method based on system call
CN111639337B (en) Unknown malicious code detection method and system for massive Windows software
CN107659570A (en) Webshell detection methods and system based on machine learning and static and dynamic analysis
CN109905385B (en) Webshell detection method, device and system
Sabhadiya et al. Android malware detection using deep learning
CN109614795B (en) Event-aware android malicious software detection method
CN108280350A (en) A kind of mobile network's terminal Malware multiple features detection method towards Android
KR101858620B1 (en) Device and method for analyzing javascript using machine learning
CN110795732A (en) SVM-based dynamic and static combination detection method for malicious codes of Android mobile network terminal
CN109255241B (en) Android permission promotion vulnerability detection method and system based on machine learning
Kakisim et al. Sequential opcode embedding-based malware detection method
Thiyagarajan et al. Improved real‐time permission based malware detection and clustering approach using model independent pruning
Li et al. Novel Android Malware Detection Method Based on Multi-dimensional Hybrid Features Extraction and Analysis.
CN114595451A (en) Graph convolution-based android malicious application classification method
Ugarte-Pedrero et al. On the adoption of anomaly detection for packed executable filtering
Hang et al. Malware detection method of android application based on simplification instructions
CN112016088A (en) Method and device for generating file detection model and method and device for detecting file
CN114285587A (en) Domain name identification method and device and domain name classification model acquisition method and device
KR102192196B1 (en) An apparatus and method for detecting malicious codes using ai based machine running cross validation techniques
CN112163217B (en) Malware variant identification method, device, equipment and computer storage medium
Shi et al. SFCGDroid: android malware detection based on sensitive function call graph
CN110795705B (en) Track data processing method, device and equipment and storage medium
Guo et al. Classification of malware variant based on ensemble learning
CN113409014A (en) Big data service processing method based on artificial intelligence and artificial intelligence server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant