CN109508545B

CN109508545B - Android Malware classification method based on sparse representation and model fusion

Info

Publication number: CN109508545B
Application number: CN201811331646.2A
Authority: CN
Inventors: 文伟平; 胡浩然; 汪子龙
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2018-11-09
Filing date: 2018-11-09
Publication date: 2021-06-04
Anticipated expiration: 2038-11-09
Also published as: CN109508545A

Abstract

The invention discloses a method for classifying Android Malware based on sparse representation and model fusion, which is used for representing behavior characteristics of Android Malware of Android malicious programs by adopting a sparse representation method; and then, classification prediction is carried out by adopting a Stacking model fusion method, so that the model prediction performance is improved. The method of the invention performs sparse representation on the original features extracted from the program to obtain more essential features of the malicious program; the fitting of the model can be on the basis of the basic model, so that the model with higher generalization capability is realized, and the accuracy of android malicious software classification is improved.

Description

Android Malware classification method based on sparse representation and model fusion

Technical Field

The invention belongs to the technical field of information security, relates to a malicious software detection technology, and particularly relates to an Android Malware (Android malicious software) classification method based on sparse representation and model fusion.

Background

In the field of mobile security, due to the popularity and openness of the Android system, the Android system becomes a plurality of hacker attacks, and Android malicious software becomes a huge threat to the Android system.

In 2017, according to the report of newzo, the number of global android mobile phone users reaches 23 hundred million. In 2017, according to the report of Tencent, the number of users infected with android mobile phone viruses reaches 1.88 hundred million. Many android phone users and phone viruses look thin by manual detection alone. The problem of being able to automatically perform classification detection of malicious programs is at hand. By means of a machine learning method, the android malicious program features (behavior features) can be learned, and the machine learning model can automatically classify malicious programs. However, the existing machine learning technology detects malicious codes, mainly directly inputs the extracted features into a single model for training, and the actual detection effect is very limited.

Disclosure of Invention

In order to overcome the defects of the prior art, the Android malicious program classification method based on sparse representation and model fusion is provided, and Android malicious programs are efficiently and accurately classified by adopting a machine learning method, so that malicious program identification approaches automation, and the requirement for classifying the Android malicious programs in reality can be better solved.

For convenience, the following terms are defined herein for brevity and for the full name of correspondence:

RF: random Forest, Random Forest;

ET: extreme random Trees;

AB：AdaBoost；

GBDT：Gradient Boosting Decision Tree；

XgBoost：Extreme Gradient Boosting。

according to the method, by learning the behavior characteristics of the Android program, the extracted characteristics are only learned directly in the prior art, but the sparse representation method is adopted, and the K-SVD algorithm (a classical dictionary training algorithm) is used for sparse representation of the original characteristics extracted from the malicious program, so that more essential characteristics can be further found. In the prior art, the learning of features is usually directly aimed at the learning of a single model, and the invention adopts a Stacking model fusion method to improve the overall prediction performance of the model.

The technical scheme provided by the invention is as follows:

a sparse representation and model fusion based Android Malware classification method for Android, which is characterized in that more essential behavior characteristics of Android Malware codes are obtained by mining through a sparse representation method; a Stacking model fusion method is adopted to improve the prediction performance of the model; the method comprises the following steps:

A. extracting the behavior characteristics of the android malicious program, and executing the following operations:

A1. downloading and installing an open source QEMU simulator;

A2. aiming at a data set of android malicious programs, running each android malicious program in the data set on a QEMU (QEMU) model machine, and detecting an API (application program interface) called by a system of the android malicious program;

A3. obtaining an API time sequence calling sequence and related information (including class name, function name and function parameter) thereof, marking the virus type and storing the virus type into a virus library;

B. sparse representation API time sequence data is taken as the behavior characteristic of a malicious program, and the following operations are carried out:

B1. setting F to be a matrix of malicious program/code behavior characteristics of n × p, wherein n represents the number of samples (malicious program/code behavior characteristic data), and p represents a dimension for extracting behavior characteristics from malicious program codes;

B2. training and learning by using a K-SVD algorithm, wherein an objective function is as follows:

D,X＝argmin{||X||₀}；s.t.||F-D*X||₂≤ε

where D is the dictionary set learned from the data set, X is the sparse representation of the data set, and ε is the maximum value of the error allowed to reconstruct the feature matrix.

B3. Learning to obtain sparse representation X' of a behavior characteristic matrix of the malicious code;

C. and (3) performing Stacking fusion of the models, and performing the following operations:

C1. selecting { RF, ET, AB, GBDT }, as a base model of a first layer, making a prediction for X' and outputting a probability corresponding to each class;

C2. taking XgBoost as a fusion model of a second layer, inputting a prediction result of a basic model of a first layer, and outputting a final classification result, namely a malicious program type;

through the steps, the Android Malware classification based on sparse representation and model fusion is achieved.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a classification method of Android Malware based on sparse representation and model fusion, which is used for describing more essential characteristics of codes by adopting a sparse representation method; and a Stacking model fusion method is adopted to improve the prediction performance of the model. The method of the invention performs sparse representation on the original features extracted from the program to obtain more essential features in the android malicious program; the fitting of the model can be on the basis of the basic model, so that the model with higher generalization capability is realized, and the accuracy of android malicious software classification is improved.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

Fig. 1 is an implementation process of the classification method of Android Malware based on sparse representation and model fusion provided by the invention. The specific embodiment of the invention is as follows:

A. data sets of various android malware are collected in the following manner:

A1. applying for downloading data of android malicious programs from some open-source organizations, such as: android Malware Dataset, http:// amd. argussab. org /), placing the downloaded and collected Android malicious programs APP under a file to construct an own Android malicious program sample set;

B. extracting the behavior characteristics of the malicious program in the following way:

B1. downloading and installing an open source QEMU simulator, and putting a malicious program sample into a QEMU environment for operation;

B2. identifying main programs of the sample, wherein the main processes of different types of malicious programs are different;

B3. enumerating the current main process to obtain a monitored process list;

B4. for each monitored process, comparing each called main program API through a process PC pointer, and capturing the API to obtain a time sequence calling sequence of the API and related information (class name, function name and function parameter) thereof;

B5. and storing the information (the time sequence calling sequence of the API and the related information thereof) into a virus library, and marking the virus type. C. Sparse representation of behavior features is performed in the following way:

training and learning are carried out through a K-SVD algorithm, and a dictionary set D and a sparse representation X of a data set are obtained from the data set through learning;

C1. randomly initializing a dictionary (variable) X by using a K-SVD algorithm;

C2. keeping the dictionary D unchanged, solving the sparse code X of each sample, and obtaining an objective function:

D,X＝argmin||X||₀；s.t.||F-D*X||₂≤ε

d is a dictionary set which is learnt from the data set, X is sparse representation of the data set, and epsilon is the maximum value of the allowed error of the reconstruction characteristic matrix; f is a matrix of behavioral characteristics of (n × p), n is the number of samples, and p is the sample dimension;

C3. updating the dictionary D and updating the corresponding non-zero code X;

C4. and repeating the steps C2 and C3 until convergence. Learning to obtain sparse representation X' representing behavior characteristics;

D. model training and model fusion, in the following way:

D1. calling RF, ET, AB and GBDT in python library of skelen. ensemble as a basic model of a first layer, calling GridSearchCV of skelen. model _ selection, and carrying out automatic parameter-adjusting training on X';

D2. dividing X 'into a training set and a testing set, using 5-Fold cross validation, reserving 1 part of the X' each time, using the other 4 parts as training, and predicting the trained model on the reserved part and the testing set respectively. This process is cycled for 5 times, the predicted result (probability corresponding to each class) is saved, and the above process is repeated for each basic model;

D3. since the output is the distribution of discrete random variables, we take the arithmetic mean of 5 times of prediction results on the test set as the final prediction result;

D4. taking a prediction result of the first layer basic model as input of a second layer model, using XgBoost as a stacking fusion model by the second layer model, and training the XgBoost to obtain a trained second layer model;

D5. and D3, taking the predicted result of the first-layer test set finally obtained in the step D3 as the input of the second-layer model, and predicting the final result.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A sparse representation and model fusion based Android malicious program Android Malware classification method is characterized in that behavior characteristics of the Android malicious program Android Malware are represented by a sparse representation method; classifying and predicting by adopting a Stacking model fusion method, so that the model prediction performance is improved; the method comprises the following steps:

A. extracting the behavior characteristics of the android malicious program; the following operations are performed:

A1. downloading and installing a QEMU simulator;

A3. obtaining an API time sequence calling sequence and related information, marking virus types and storing the virus types into a virus library;

B. sparse representation API time sequence calling sequence data as behavior characteristics of the malicious program; the following operations are specifically executed:

B1. setting F to be a matrix of behavior characteristics of the malicious programs with n x p, wherein n represents the number of the malicious programs, and p represents the dimension of extracting the behavior characteristics from the malicious programs;

D,X＝argmin{||X||₀}；s.t.||F-D*X||₂≤ε

wherein D is a dictionary set obtained by learning from the data set; x is a sparse representation of the data set; ε is the maximum allowable error for the reconstructed feature matrix;

B3. learning to obtain sparse representation X' of a behavior characteristic matrix of the malicious program;

C1. selecting { RF, ET, AdaBoost, GBDT } as a basic model of a first layer, predicting X' and outputting the probability corresponding to each class;

C2. taking XgBoost as a fusion model of a second layer, inputting a prediction result of a basic model of a first layer by the model, and outputting a final classification result, namely a malicious program type;

2. The classification method as claimed in claim 1, wherein the data set of the android malware is obtained by network download.

3. The classification method according to claim 1, wherein the step a3 is specifically for each monitored process, performing API capture by comparing a process PC pointer with each called main program API to obtain a time sequence calling sequence and related information of the API; the related information includes a class name, a function name, and a function parameter.

4. The classification method according to claim 1, wherein the step C is performed by calling RF, ET, AdaBoost, GBDT in python library of sklern. ensemble as a base model of the first layer; model _ selection, GridSearchCV, is specifically called to perform automatic parameter-tuning training on X'.

5. The classification method as claimed in claim 4, wherein, for the basic model, X' is divided into a training set and a testing set, 5-Fold cross validation is adopted, 1 part of data is set aside each time, and the other 4 parts of data are used for training; respectively predicting the reserved data and the test set by using the model obtained by training; the prediction process comprises a plurality of cycles, and the result obtained by each prediction is the probability corresponding to each class.

6. The classification method according to claim 5, wherein a plurality of results obtained by performing a plurality of predictions on the test set are arithmetically averaged to be used as the prediction result.