CN109508545B - Android Malware classification method based on sparse representation and model fusion - Google Patents

Android Malware classification method based on sparse representation and model fusion Download PDF

Info

Publication number
CN109508545B
CN109508545B CN201811331646.2A CN201811331646A CN109508545B CN 109508545 B CN109508545 B CN 109508545B CN 201811331646 A CN201811331646 A CN 201811331646A CN 109508545 B CN109508545 B CN 109508545B
Authority
CN
China
Prior art keywords
model
android
sparse representation
malicious
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811331646.2A
Other languages
Chinese (zh)
Other versions
CN109508545A (en
Inventor
文伟平
胡浩然
汪子龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201811331646.2A priority Critical patent/CN109508545B/en
Publication of CN109508545A publication Critical patent/CN109508545A/en
Application granted granted Critical
Publication of CN109508545B publication Critical patent/CN109508545B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Debugging And Monitoring (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a method for classifying Android Malware based on sparse representation and model fusion, which is used for representing behavior characteristics of Android Malware of Android malicious programs by adopting a sparse representation method; and then, classification prediction is carried out by adopting a Stacking model fusion method, so that the model prediction performance is improved. The method of the invention performs sparse representation on the original features extracted from the program to obtain more essential features of the malicious program; the fitting of the model can be on the basis of the basic model, so that the model with higher generalization capability is realized, and the accuracy of android malicious software classification is improved.

Description

Android Malware classification method based on sparse representation and model fusion
Technical Field
The invention belongs to the technical field of information security, relates to a malicious software detection technology, and particularly relates to an Android Malware (Android malicious software) classification method based on sparse representation and model fusion.
Background
In the field of mobile security, due to the popularity and openness of the Android system, the Android system becomes a plurality of hacker attacks, and Android malicious software becomes a huge threat to the Android system.
In 2017, according to the report of newzo, the number of global android mobile phone users reaches 23 hundred million. In 2017, according to the report of Tencent, the number of users infected with android mobile phone viruses reaches 1.88 hundred million. Many android phone users and phone viruses look thin by manual detection alone. The problem of being able to automatically perform classification detection of malicious programs is at hand. By means of a machine learning method, the android malicious program features (behavior features) can be learned, and the machine learning model can automatically classify malicious programs. However, the existing machine learning technology detects malicious codes, mainly directly inputs the extracted features into a single model for training, and the actual detection effect is very limited.
Disclosure of Invention
In order to overcome the defects of the prior art, the Android malicious program classification method based on sparse representation and model fusion is provided, and Android malicious programs are efficiently and accurately classified by adopting a machine learning method, so that malicious program identification approaches automation, and the requirement for classifying the Android malicious programs in reality can be better solved.
For convenience, the following terms are defined herein for brevity and for the full name of correspondence:
RF: random Forest, Random Forest;
ET: extreme random Trees;
AB:AdaBoost;
GBDT:Gradient Boosting Decision Tree;
XgBoost:Extreme Gradient Boosting。
according to the method, by learning the behavior characteristics of the Android program, the extracted characteristics are only learned directly in the prior art, but the sparse representation method is adopted, and the K-SVD algorithm (a classical dictionary training algorithm) is used for sparse representation of the original characteristics extracted from the malicious program, so that more essential characteristics can be further found. In the prior art, the learning of features is usually directly aimed at the learning of a single model, and the invention adopts a Stacking model fusion method to improve the overall prediction performance of the model.
The technical scheme provided by the invention is as follows:
a sparse representation and model fusion based Android Malware classification method for Android, which is characterized in that more essential behavior characteristics of Android Malware codes are obtained by mining through a sparse representation method; a Stacking model fusion method is adopted to improve the prediction performance of the model; the method comprises the following steps:
A. extracting the behavior characteristics of the android malicious program, and executing the following operations:
A1. downloading and installing an open source QEMU simulator;
A2. aiming at a data set of android malicious programs, running each android malicious program in the data set on a QEMU (QEMU) model machine, and detecting an API (application program interface) called by a system of the android malicious program;
A3. obtaining an API time sequence calling sequence and related information (including class name, function name and function parameter) thereof, marking the virus type and storing the virus type into a virus library;
B. sparse representation API time sequence data is taken as the behavior characteristic of a malicious program, and the following operations are carried out:
B1. setting F to be a matrix of malicious program/code behavior characteristics of n × p, wherein n represents the number of samples (malicious program/code behavior characteristic data), and p represents a dimension for extracting behavior characteristics from malicious program codes;
B2. training and learning by using a K-SVD algorithm, wherein an objective function is as follows:
D,X=argmin{||X||0};s.t.||F-D*X||2≤ε
where D is the dictionary set learned from the data set, X is the sparse representation of the data set, and ε is the maximum value of the error allowed to reconstruct the feature matrix.
B3. Learning to obtain sparse representation X' of a behavior characteristic matrix of the malicious code;
C. and (3) performing Stacking fusion of the models, and performing the following operations:
C1. selecting { RF, ET, AB, GBDT }, as a base model of a first layer, making a prediction for X' and outputting a probability corresponding to each class;
C2. taking XgBoost as a fusion model of a second layer, inputting a prediction result of a basic model of a first layer, and outputting a final classification result, namely a malicious program type;
through the steps, the Android Malware classification based on sparse representation and model fusion is achieved.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a classification method of Android Malware based on sparse representation and model fusion, which is used for describing more essential characteristics of codes by adopting a sparse representation method; and a Stacking model fusion method is adopted to improve the prediction performance of the model. The method of the invention performs sparse representation on the original features extracted from the program to obtain more essential features in the android malicious program; the fitting of the model can be on the basis of the basic model, so that the model with higher generalization capability is realized, and the accuracy of android malicious software classification is improved.
Drawings
FIG. 1 is a block flow diagram of the method of the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
Fig. 1 is an implementation process of the classification method of Android Malware based on sparse representation and model fusion provided by the invention. The specific embodiment of the invention is as follows:
A. data sets of various android malware are collected in the following manner:
A1. applying for downloading data of android malicious programs from some open-source organizations, such as: android Malware Dataset, http:// amd. argussab. org /), placing the downloaded and collected Android malicious programs APP under a file to construct an own Android malicious program sample set;
B. extracting the behavior characteristics of the malicious program in the following way:
B1. downloading and installing an open source QEMU simulator, and putting a malicious program sample into a QEMU environment for operation;
B2. identifying main programs of the sample, wherein the main processes of different types of malicious programs are different;
B3. enumerating the current main process to obtain a monitored process list;
B4. for each monitored process, comparing each called main program API through a process PC pointer, and capturing the API to obtain a time sequence calling sequence of the API and related information (class name, function name and function parameter) thereof;
B5. and storing the information (the time sequence calling sequence of the API and the related information thereof) into a virus library, and marking the virus type. C. Sparse representation of behavior features is performed in the following way:
training and learning are carried out through a K-SVD algorithm, and a dictionary set D and a sparse representation X of a data set are obtained from the data set through learning;
C1. randomly initializing a dictionary (variable) X by using a K-SVD algorithm;
C2. keeping the dictionary D unchanged, solving the sparse code X of each sample, and obtaining an objective function:
D,X=argmin||X||0;s.t.||F-D*X||2≤ε
d is a dictionary set which is learnt from the data set, X is sparse representation of the data set, and epsilon is the maximum value of the allowed error of the reconstruction characteristic matrix; f is a matrix of behavioral characteristics of (n × p), n is the number of samples, and p is the sample dimension;
C3. updating the dictionary D and updating the corresponding non-zero code X;
C4. and repeating the steps C2 and C3 until convergence. Learning to obtain sparse representation X' representing behavior characteristics;
D. model training and model fusion, in the following way:
D1. calling RF, ET, AB and GBDT in python library of skelen. ensemble as a basic model of a first layer, calling GridSearchCV of skelen. model _ selection, and carrying out automatic parameter-adjusting training on X';
D2. dividing X 'into a training set and a testing set, using 5-Fold cross validation, reserving 1 part of the X' each time, using the other 4 parts as training, and predicting the trained model on the reserved part and the testing set respectively. This process is cycled for 5 times, the predicted result (probability corresponding to each class) is saved, and the above process is repeated for each basic model;
D3. since the output is the distribution of discrete random variables, we take the arithmetic mean of 5 times of prediction results on the test set as the final prediction result;
D4. taking a prediction result of the first layer basic model as input of a second layer model, using XgBoost as a stacking fusion model by the second layer model, and training the XgBoost to obtain a trained second layer model;
D5. and D3, taking the predicted result of the first-layer test set finally obtained in the step D3 as the input of the second-layer model, and predicting the final result.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (6)

1. A sparse representation and model fusion based Android malicious program Android Malware classification method is characterized in that behavior characteristics of the Android malicious program Android Malware are represented by a sparse representation method; classifying and predicting by adopting a Stacking model fusion method, so that the model prediction performance is improved; the method comprises the following steps:
A. extracting the behavior characteristics of the android malicious program; the following operations are performed:
A1. downloading and installing a QEMU simulator;
A2. aiming at a data set of android malicious programs, running each android malicious program in the data set on a QEMU (QEMU) model machine, and detecting an API (application program interface) called by a system of the android malicious program;
A3. obtaining an API time sequence calling sequence and related information, marking virus types and storing the virus types into a virus library;
B. sparse representation API time sequence calling sequence data as behavior characteristics of the malicious program; the following operations are specifically executed:
B1. setting F to be a matrix of behavior characteristics of the malicious programs with n x p, wherein n represents the number of the malicious programs, and p represents the dimension of extracting the behavior characteristics from the malicious programs;
B2. training and learning by using a K-SVD algorithm, wherein an objective function is as follows:
D,X=argmin{||X||0};s.t.||F-D*X||2≤ε
wherein D is a dictionary set obtained by learning from the data set; x is a sparse representation of the data set; ε is the maximum allowable error for the reconstructed feature matrix;
B3. learning to obtain sparse representation X' of a behavior characteristic matrix of the malicious program;
C. and (3) performing Stacking fusion of the models, and performing the following operations:
C1. selecting { RF, ET, AdaBoost, GBDT } as a basic model of a first layer, predicting X' and outputting the probability corresponding to each class;
C2. taking XgBoost as a fusion model of a second layer, inputting a prediction result of a basic model of a first layer by the model, and outputting a final classification result, namely a malicious program type;
through the steps, the Android Malware classification based on sparse representation and model fusion is achieved.
2. The classification method as claimed in claim 1, wherein the data set of the android malware is obtained by network download.
3. The classification method according to claim 1, wherein the step a3 is specifically for each monitored process, performing API capture by comparing a process PC pointer with each called main program API to obtain a time sequence calling sequence and related information of the API; the related information includes a class name, a function name, and a function parameter.
4. The classification method according to claim 1, wherein the step C is performed by calling RF, ET, AdaBoost, GBDT in python library of sklern. ensemble as a base model of the first layer; model _ selection, GridSearchCV, is specifically called to perform automatic parameter-tuning training on X'.
5. The classification method as claimed in claim 4, wherein, for the basic model, X' is divided into a training set and a testing set, 5-Fold cross validation is adopted, 1 part of data is set aside each time, and the other 4 parts of data are used for training; respectively predicting the reserved data and the test set by using the model obtained by training; the prediction process comprises a plurality of cycles, and the result obtained by each prediction is the probability corresponding to each class.
6. The classification method according to claim 5, wherein a plurality of results obtained by performing a plurality of predictions on the test set are arithmetically averaged to be used as the prediction result.
CN201811331646.2A 2018-11-09 2018-11-09 Android Malware classification method based on sparse representation and model fusion Active CN109508545B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811331646.2A CN109508545B (en) 2018-11-09 2018-11-09 Android Malware classification method based on sparse representation and model fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811331646.2A CN109508545B (en) 2018-11-09 2018-11-09 Android Malware classification method based on sparse representation and model fusion

Publications (2)

Publication Number Publication Date
CN109508545A CN109508545A (en) 2019-03-22
CN109508545B true CN109508545B (en) 2021-06-04

Family

ID=65748013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811331646.2A Active CN109508545B (en) 2018-11-09 2018-11-09 Android Malware classification method based on sparse representation and model fusion

Country Status (1)

Country Link
CN (1) CN109508545B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814147A (en) * 2020-06-03 2020-10-23 武汉科技大学 Android malicious software detection method based on model library
CN112000954B (en) * 2020-08-25 2024-01-30 华侨大学 Malicious software detection method based on feature sequence mining and simplification
CN113378156B (en) * 2021-07-01 2023-07-11 上海观安信息技术股份有限公司 API-based malicious file detection method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376262A (en) * 2014-12-08 2015-02-25 中国科学院深圳先进技术研究院 Android malware detecting method based on Dalvik command and authority combination
CN105893848A (en) * 2016-04-27 2016-08-24 南京邮电大学 Precaution method for Android malicious application program based on code behavior similarity matching
CN105989287A (en) * 2015-12-30 2016-10-05 武汉安天信息技术有限责任公司 Method and system for judging homology of massive malicious samples
CN107194251A (en) * 2017-04-01 2017-09-22 中国科学院信息工程研究所 Android platform malicious application detection method and device
CN108717511A (en) * 2018-05-14 2018-10-30 中国科学院信息工程研究所 A kind of Android applications Threat assessment models method for building up, appraisal procedure and system
CN108737443A (en) * 2018-06-14 2018-11-02 北京大学 A kind of concealment mail address method based on cryptographic algorithm

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130007848A1 (en) * 2011-07-01 2013-01-03 Airtight Networks, Inc. Monitoring of smart mobile devices in the wireless access networks
CN104102879B (en) * 2013-04-15 2016-08-17 腾讯科技(深圳)有限公司 The extracting method of a kind of message format and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376262A (en) * 2014-12-08 2015-02-25 中国科学院深圳先进技术研究院 Android malware detecting method based on Dalvik command and authority combination
CN105989287A (en) * 2015-12-30 2016-10-05 武汉安天信息技术有限责任公司 Method and system for judging homology of massive malicious samples
CN105893848A (en) * 2016-04-27 2016-08-24 南京邮电大学 Precaution method for Android malicious application program based on code behavior similarity matching
CN107194251A (en) * 2017-04-01 2017-09-22 中国科学院信息工程研究所 Android platform malicious application detection method and device
CN108717511A (en) * 2018-05-14 2018-10-30 中国科学院信息工程研究所 A kind of Android applications Threat assessment models method for building up, appraisal procedure and system
CN108737443A (en) * 2018-06-14 2018-11-02 北京大学 A kind of concealment mail address method based on cryptographic algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Malware Detection using Windows Api Sequence and Machine Learning;Chandrasekar Ravi;《International Journal of Computer Applications》;20120430;第43卷(第17期);第12-16页 *
一种基于Android内核的APP敏感行为检测方法及实现;文伟平;《信息网络安全》;20160810(第8期);第18-23页 *
基于API调用序列的Android恶意代码检测方法研究;陈铁明;《浙江工业大学学报》;20180409;第46卷(第2期);第147-154页 *

Also Published As

Publication number Publication date
CN109508545A (en) 2019-03-22

Similar Documents

Publication Publication Date Title
Martín et al. MOCDroid: multi-objective evolutionary classifier for Android malware detection
CN109508545B (en) Android Malware classification method based on sparse representation and model fusion
US20150213365A1 (en) Methods and systems for classification of software applications
CN107679403B (en) Lesso software variety detection method based on sequence comparison algorithm
US11568049B2 (en) Methods and apparatus to defend against adversarial machine learning
US11106801B1 (en) Utilizing orchestration and augmented vulnerability triage for software security testing
Nguyen et al. Comparison of three deep learning-based approaches for IoT malware detection
CN111444513B (en) Firmware compiling optimization option identification method and device for power grid embedded terminal
CN109740347A (en) A kind of identification of the fragile hash function for smart machine firmware and crack method
KR20190102451A (en) Method for detecting malicious application and apparatus thereof
Karbab et al. Petadroid: Adaptive android malware detection using deep learning
CN113626241A (en) Application program exception handling method, device, equipment and storage medium
CN111382783A (en) Malicious software identification method and device and storage medium
CN113221109A (en) Intelligent malicious file analysis method based on generation countermeasure network
Ebrahimi et al. Binary black-box attacks against static malware detectors with reinforcement learning in discrete action spaces
CN116522338A (en) File processing method, equipment and computer readable storage medium
Dahl et al. Stack-based buffer overflow detection using recurrent neural networks
CN114595451A (en) Graph convolution-based android malicious application classification method
US10248789B2 (en) File clustering using filters working over file attributes
CN110197068B (en) Android malicious application detection method based on improved grayish wolf algorithm
CN112764791B (en) Incremental update malicious software detection method and system
US11934533B2 (en) Detection of supply chain-related security threats to software applications
Kumar et al. A survey of deep learning techniques for malware analysis
CN112770323A (en) Mobile malicious application family classification method based on network traffic space time characteristics
Singh et al. Metamorphic detection of repackaged malware

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant