CN109508545B - Android Malware classification method based on sparse representation and model fusion - Google Patents
Android Malware classification method based on sparse representation and model fusion Download PDFInfo
- Publication number
- CN109508545B CN109508545B CN201811331646.2A CN201811331646A CN109508545B CN 109508545 B CN109508545 B CN 109508545B CN 201811331646 A CN201811331646 A CN 201811331646A CN 109508545 B CN109508545 B CN 109508545B
- Authority
- CN
- China
- Prior art keywords
- model
- android
- sparse representation
- malicious
- fusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/566—Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Debugging And Monitoring (AREA)
- Stored Programmes (AREA)
Abstract
The invention discloses a method for classifying Android Malware based on sparse representation and model fusion, which is used for representing behavior characteristics of Android Malware of Android malicious programs by adopting a sparse representation method; and then, classification prediction is carried out by adopting a Stacking model fusion method, so that the model prediction performance is improved. The method of the invention performs sparse representation on the original features extracted from the program to obtain more essential features of the malicious program; the fitting of the model can be on the basis of the basic model, so that the model with higher generalization capability is realized, and the accuracy of android malicious software classification is improved.
Description
Technical Field
The invention belongs to the technical field of information security, relates to a malicious software detection technology, and particularly relates to an Android Malware (Android malicious software) classification method based on sparse representation and model fusion.
Background
In the field of mobile security, due to the popularity and openness of the Android system, the Android system becomes a plurality of hacker attacks, and Android malicious software becomes a huge threat to the Android system.
In 2017, according to the report of newzo, the number of global android mobile phone users reaches 23 hundred million. In 2017, according to the report of Tencent, the number of users infected with android mobile phone viruses reaches 1.88 hundred million. Many android phone users and phone viruses look thin by manual detection alone. The problem of being able to automatically perform classification detection of malicious programs is at hand. By means of a machine learning method, the android malicious program features (behavior features) can be learned, and the machine learning model can automatically classify malicious programs. However, the existing machine learning technology detects malicious codes, mainly directly inputs the extracted features into a single model for training, and the actual detection effect is very limited.
Disclosure of Invention
In order to overcome the defects of the prior art, the Android malicious program classification method based on sparse representation and model fusion is provided, and Android malicious programs are efficiently and accurately classified by adopting a machine learning method, so that malicious program identification approaches automation, and the requirement for classifying the Android malicious programs in reality can be better solved.
For convenience, the following terms are defined herein for brevity and for the full name of correspondence:
RF: random Forest, Random Forest;
ET: extreme random Trees;
AB:AdaBoost;
GBDT:Gradient Boosting Decision Tree;
XgBoost:Extreme Gradient Boosting。
according to the method, by learning the behavior characteristics of the Android program, the extracted characteristics are only learned directly in the prior art, but the sparse representation method is adopted, and the K-SVD algorithm (a classical dictionary training algorithm) is used for sparse representation of the original characteristics extracted from the malicious program, so that more essential characteristics can be further found. In the prior art, the learning of features is usually directly aimed at the learning of a single model, and the invention adopts a Stacking model fusion method to improve the overall prediction performance of the model.
The technical scheme provided by the invention is as follows:
a sparse representation and model fusion based Android Malware classification method for Android, which is characterized in that more essential behavior characteristics of Android Malware codes are obtained by mining through a sparse representation method; a Stacking model fusion method is adopted to improve the prediction performance of the model; the method comprises the following steps:
A. extracting the behavior characteristics of the android malicious program, and executing the following operations:
A1. downloading and installing an open source QEMU simulator;
A2. aiming at a data set of android malicious programs, running each android malicious program in the data set on a QEMU (QEMU) model machine, and detecting an API (application program interface) called by a system of the android malicious program;
A3. obtaining an API time sequence calling sequence and related information (including class name, function name and function parameter) thereof, marking the virus type and storing the virus type into a virus library;
B. sparse representation API time sequence data is taken as the behavior characteristic of a malicious program, and the following operations are carried out:
B1. setting F to be a matrix of malicious program/code behavior characteristics of n × p, wherein n represents the number of samples (malicious program/code behavior characteristic data), and p represents a dimension for extracting behavior characteristics from malicious program codes;
B2. training and learning by using a K-SVD algorithm, wherein an objective function is as follows:
D,X=argmin{||X||0};s.t.||F-D*X||2≤ε
where D is the dictionary set learned from the data set, X is the sparse representation of the data set, and ε is the maximum value of the error allowed to reconstruct the feature matrix.
B3. Learning to obtain sparse representation X' of a behavior characteristic matrix of the malicious code;
C. and (3) performing Stacking fusion of the models, and performing the following operations:
C1. selecting { RF, ET, AB, GBDT }, as a base model of a first layer, making a prediction for X' and outputting a probability corresponding to each class;
C2. taking XgBoost as a fusion model of a second layer, inputting a prediction result of a basic model of a first layer, and outputting a final classification result, namely a malicious program type;
through the steps, the Android Malware classification based on sparse representation and model fusion is achieved.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a classification method of Android Malware based on sparse representation and model fusion, which is used for describing more essential characteristics of codes by adopting a sparse representation method; and a Stacking model fusion method is adopted to improve the prediction performance of the model. The method of the invention performs sparse representation on the original features extracted from the program to obtain more essential features in the android malicious program; the fitting of the model can be on the basis of the basic model, so that the model with higher generalization capability is realized, and the accuracy of android malicious software classification is improved.
Drawings
FIG. 1 is a block flow diagram of the method of the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
Fig. 1 is an implementation process of the classification method of Android Malware based on sparse representation and model fusion provided by the invention. The specific embodiment of the invention is as follows:
A. data sets of various android malware are collected in the following manner:
A1. applying for downloading data of android malicious programs from some open-source organizations, such as: android Malware Dataset, http:// amd. argussab. org /), placing the downloaded and collected Android malicious programs APP under a file to construct an own Android malicious program sample set;
B. extracting the behavior characteristics of the malicious program in the following way:
B1. downloading and installing an open source QEMU simulator, and putting a malicious program sample into a QEMU environment for operation;
B2. identifying main programs of the sample, wherein the main processes of different types of malicious programs are different;
B3. enumerating the current main process to obtain a monitored process list;
B4. for each monitored process, comparing each called main program API through a process PC pointer, and capturing the API to obtain a time sequence calling sequence of the API and related information (class name, function name and function parameter) thereof;
B5. and storing the information (the time sequence calling sequence of the API and the related information thereof) into a virus library, and marking the virus type. C. Sparse representation of behavior features is performed in the following way:
training and learning are carried out through a K-SVD algorithm, and a dictionary set D and a sparse representation X of a data set are obtained from the data set through learning;
C1. randomly initializing a dictionary (variable) X by using a K-SVD algorithm;
C2. keeping the dictionary D unchanged, solving the sparse code X of each sample, and obtaining an objective function:
D,X=argmin||X||0;s.t.||F-D*X||2≤ε
d is a dictionary set which is learnt from the data set, X is sparse representation of the data set, and epsilon is the maximum value of the allowed error of the reconstruction characteristic matrix; f is a matrix of behavioral characteristics of (n × p), n is the number of samples, and p is the sample dimension;
C3. updating the dictionary D and updating the corresponding non-zero code X;
C4. and repeating the steps C2 and C3 until convergence. Learning to obtain sparse representation X' representing behavior characteristics;
D. model training and model fusion, in the following way:
D1. calling RF, ET, AB and GBDT in python library of skelen. ensemble as a basic model of a first layer, calling GridSearchCV of skelen. model _ selection, and carrying out automatic parameter-adjusting training on X';
D2. dividing X 'into a training set and a testing set, using 5-Fold cross validation, reserving 1 part of the X' each time, using the other 4 parts as training, and predicting the trained model on the reserved part and the testing set respectively. This process is cycled for 5 times, the predicted result (probability corresponding to each class) is saved, and the above process is repeated for each basic model;
D3. since the output is the distribution of discrete random variables, we take the arithmetic mean of 5 times of prediction results on the test set as the final prediction result;
D4. taking a prediction result of the first layer basic model as input of a second layer model, using XgBoost as a stacking fusion model by the second layer model, and training the XgBoost to obtain a trained second layer model;
D5. and D3, taking the predicted result of the first-layer test set finally obtained in the step D3 as the input of the second-layer model, and predicting the final result.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.
Claims (6)
1. A sparse representation and model fusion based Android malicious program Android Malware classification method is characterized in that behavior characteristics of the Android malicious program Android Malware are represented by a sparse representation method; classifying and predicting by adopting a Stacking model fusion method, so that the model prediction performance is improved; the method comprises the following steps:
A. extracting the behavior characteristics of the android malicious program; the following operations are performed:
A1. downloading and installing a QEMU simulator;
A2. aiming at a data set of android malicious programs, running each android malicious program in the data set on a QEMU (QEMU) model machine, and detecting an API (application program interface) called by a system of the android malicious program;
A3. obtaining an API time sequence calling sequence and related information, marking virus types and storing the virus types into a virus library;
B. sparse representation API time sequence calling sequence data as behavior characteristics of the malicious program; the following operations are specifically executed:
B1. setting F to be a matrix of behavior characteristics of the malicious programs with n x p, wherein n represents the number of the malicious programs, and p represents the dimension of extracting the behavior characteristics from the malicious programs;
B2. training and learning by using a K-SVD algorithm, wherein an objective function is as follows:
D,X=argmin{||X||0};s.t.||F-D*X||2≤ε
wherein D is a dictionary set obtained by learning from the data set; x is a sparse representation of the data set; ε is the maximum allowable error for the reconstructed feature matrix;
B3. learning to obtain sparse representation X' of a behavior characteristic matrix of the malicious program;
C. and (3) performing Stacking fusion of the models, and performing the following operations:
C1. selecting { RF, ET, AdaBoost, GBDT } as a basic model of a first layer, predicting X' and outputting the probability corresponding to each class;
C2. taking XgBoost as a fusion model of a second layer, inputting a prediction result of a basic model of a first layer by the model, and outputting a final classification result, namely a malicious program type;
through the steps, the Android Malware classification based on sparse representation and model fusion is achieved.
2. The classification method as claimed in claim 1, wherein the data set of the android malware is obtained by network download.
3. The classification method according to claim 1, wherein the step a3 is specifically for each monitored process, performing API capture by comparing a process PC pointer with each called main program API to obtain a time sequence calling sequence and related information of the API; the related information includes a class name, a function name, and a function parameter.
4. The classification method according to claim 1, wherein the step C is performed by calling RF, ET, AdaBoost, GBDT in python library of sklern. ensemble as a base model of the first layer; model _ selection, GridSearchCV, is specifically called to perform automatic parameter-tuning training on X'.
5. The classification method as claimed in claim 4, wherein, for the basic model, X' is divided into a training set and a testing set, 5-Fold cross validation is adopted, 1 part of data is set aside each time, and the other 4 parts of data are used for training; respectively predicting the reserved data and the test set by using the model obtained by training; the prediction process comprises a plurality of cycles, and the result obtained by each prediction is the probability corresponding to each class.
6. The classification method according to claim 5, wherein a plurality of results obtained by performing a plurality of predictions on the test set are arithmetically averaged to be used as the prediction result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811331646.2A CN109508545B (en) | 2018-11-09 | 2018-11-09 | Android Malware classification method based on sparse representation and model fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811331646.2A CN109508545B (en) | 2018-11-09 | 2018-11-09 | Android Malware classification method based on sparse representation and model fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109508545A CN109508545A (en) | 2019-03-22 |
CN109508545B true CN109508545B (en) | 2021-06-04 |
Family
ID=65748013
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811331646.2A Active CN109508545B (en) | 2018-11-09 | 2018-11-09 | Android Malware classification method based on sparse representation and model fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109508545B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111814147A (en) * | 2020-06-03 | 2020-10-23 | 武汉科技大学 | Android malicious software detection method based on model library |
CN112000954B (en) * | 2020-08-25 | 2024-01-30 | 华侨大学 | Malicious software detection method based on feature sequence mining and simplification |
CN113378156B (en) * | 2021-07-01 | 2023-07-11 | 上海观安信息技术股份有限公司 | API-based malicious file detection method and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376262A (en) * | 2014-12-08 | 2015-02-25 | 中国科学院深圳先进技术研究院 | Android malware detecting method based on Dalvik command and authority combination |
CN105893848A (en) * | 2016-04-27 | 2016-08-24 | 南京邮电大学 | Precaution method for Android malicious application program based on code behavior similarity matching |
CN105989287A (en) * | 2015-12-30 | 2016-10-05 | 武汉安天信息技术有限责任公司 | Method and system for judging homology of massive malicious samples |
CN107194251A (en) * | 2017-04-01 | 2017-09-22 | 中国科学院信息工程研究所 | Android platform malicious application detection method and device |
CN108717511A (en) * | 2018-05-14 | 2018-10-30 | 中国科学院信息工程研究所 | A kind of Android applications Threat assessment models method for building up, appraisal procedure and system |
CN108737443A (en) * | 2018-06-14 | 2018-11-02 | 北京大学 | A kind of concealment mail address method based on cryptographic algorithm |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130007848A1 (en) * | 2011-07-01 | 2013-01-03 | Airtight Networks, Inc. | Monitoring of smart mobile devices in the wireless access networks |
CN104102879B (en) * | 2013-04-15 | 2016-08-17 | 腾讯科技(深圳)有限公司 | The extracting method of a kind of message format and device |
-
2018
- 2018-11-09 CN CN201811331646.2A patent/CN109508545B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376262A (en) * | 2014-12-08 | 2015-02-25 | 中国科学院深圳先进技术研究院 | Android malware detecting method based on Dalvik command and authority combination |
CN105989287A (en) * | 2015-12-30 | 2016-10-05 | 武汉安天信息技术有限责任公司 | Method and system for judging homology of massive malicious samples |
CN105893848A (en) * | 2016-04-27 | 2016-08-24 | 南京邮电大学 | Precaution method for Android malicious application program based on code behavior similarity matching |
CN107194251A (en) * | 2017-04-01 | 2017-09-22 | 中国科学院信息工程研究所 | Android platform malicious application detection method and device |
CN108717511A (en) * | 2018-05-14 | 2018-10-30 | 中国科学院信息工程研究所 | A kind of Android applications Threat assessment models method for building up, appraisal procedure and system |
CN108737443A (en) * | 2018-06-14 | 2018-11-02 | 北京大学 | A kind of concealment mail address method based on cryptographic algorithm |
Non-Patent Citations (3)
Title |
---|
Malware Detection using Windows Api Sequence and Machine Learning;Chandrasekar Ravi;《International Journal of Computer Applications》;20120430;第43卷(第17期);第12-16页 * |
一种基于Android内核的APP敏感行为检测方法及实现;文伟平;《信息网络安全》;20160810(第8期);第18-23页 * |
基于API调用序列的Android恶意代码检测方法研究;陈铁明;《浙江工业大学学报》;20180409;第46卷(第2期);第147-154页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109508545A (en) | 2019-03-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Martín et al. | MOCDroid: multi-objective evolutionary classifier for Android malware detection | |
CN109508545B (en) | Android Malware classification method based on sparse representation and model fusion | |
US20150213365A1 (en) | Methods and systems for classification of software applications | |
CN107679403B (en) | Lesso software variety detection method based on sequence comparison algorithm | |
US11568049B2 (en) | Methods and apparatus to defend against adversarial machine learning | |
US11106801B1 (en) | Utilizing orchestration and augmented vulnerability triage for software security testing | |
Nguyen et al. | Comparison of three deep learning-based approaches for IoT malware detection | |
CN111444513B (en) | Firmware compiling optimization option identification method and device for power grid embedded terminal | |
CN109740347A (en) | A kind of identification of the fragile hash function for smart machine firmware and crack method | |
KR20190102451A (en) | Method for detecting malicious application and apparatus thereof | |
Karbab et al. | Petadroid: Adaptive android malware detection using deep learning | |
CN113626241A (en) | Application program exception handling method, device, equipment and storage medium | |
CN111382783A (en) | Malicious software identification method and device and storage medium | |
CN113221109A (en) | Intelligent malicious file analysis method based on generation countermeasure network | |
Ebrahimi et al. | Binary black-box attacks against static malware detectors with reinforcement learning in discrete action spaces | |
CN116522338A (en) | File processing method, equipment and computer readable storage medium | |
Dahl et al. | Stack-based buffer overflow detection using recurrent neural networks | |
CN114595451A (en) | Graph convolution-based android malicious application classification method | |
US10248789B2 (en) | File clustering using filters working over file attributes | |
CN110197068B (en) | Android malicious application detection method based on improved grayish wolf algorithm | |
CN112764791B (en) | Incremental update malicious software detection method and system | |
US11934533B2 (en) | Detection of supply chain-related security threats to software applications | |
Kumar et al. | A survey of deep learning techniques for malware analysis | |
CN112770323A (en) | Mobile malicious application family classification method based on network traffic space time characteristics | |
Singh et al. | Metamorphic detection of repackaged malware |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |