CN115577357A

CN115577357A - Android malicious software detection method based on stacking integration technology

Info

Publication number: CN115577357A
Application number: CN202211221244.3A
Authority: CN
Inventors: 刘红; 李娟�; 陈莉; 肖云鹏; 李暾; 李茜; 庞育才; 陈南羽; 马婧
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2023-01-06

Abstract

The invention belongs to the technical field of computer security, and particularly relates to an Android malicious software detection method based on a stacking integration technology, which comprises the following steps of: the method comprises the steps of obtaining an Android software APK data sample with a label of the mobile terminal, carrying out data equalization on the obtained data sample, extracting characteristics of equalized data, carrying out screening and dimension reduction according to an information entropy gain value of each characteristic, establishing an AM-Stacking malicious software detection model, and carrying out malicious software detection according to the screened characteristics. According to the invention, a plurality of models with good classification performance are fused in a stacking and integrating manner, a data set is divided by using K-fold cross validation, an attention mechanism is introduced, and the existence of malicious software can be detected more accurately and the malicious software in each existing large software platform and terminal can be effectively detected by combining a balanced data sample and a mixed characteristic processing method, so that the detection precision of the malicious software is improved.

Description

Android malicious software detection method based on stacking integration technology

Technical Field

The invention belongs to the technical field of computer security, and particularly relates to an Android malicious software detection method based on a stacking integration technology.

Background

The Android system is an operating system mainly for smart phones, which is introduced by Google, and according to survey data in recent years, the Android accounts for over 85% of the market share, and has gradually penetrated into other manufacturing fields. However, due to the open source of the Android, a large amount of malicious software is about to enter in the false environment, and a series of threat challenges are caused, wherein the threat challenges mainly include privacy disclosure, data theft, network spyware and the like, and significant economic losses are caused to users. Thus, malware detection is becoming an urgent and important issue, and is also becoming a very hot topic in the field of computational security.

For the problem of malware detection, the existing detection methods can be mainly divided into two categories: static detection and dynamic detection. Static detection is the identification of malware before an application is executed, while dynamic detection is the execution of malware detection tasks at runtime. Most scholars propose a static detection method because the dynamic detection method of the malware has high time complexity, high required cost and difficulty in detecting multipath malware.

In recent years, features extracted from configuration files and code files of software are mainly used as training samples of the models based on a traditional Android malicious software detection model, the extracted features of the malicious software are usually selected through manual intervention, and finally, a large amount of original information of the features of the malicious software is lost, so that the detection effect of the model is poor, and the detection precision of the model cannot be further improved. Secondly, a single model is usually selected for a traditional malware detection model, so that the robustness and the accuracy of the detection model need to be improved, and the generalization capability is weak.

The problems of the prior art are that:

1. in the field of malicious software detection, malicious software samples are relatively few, which causes the imbalance of the data proportion of normal software samples and malicious software samples, so that the detection performance of the model cannot achieve the expected effect;

2. the characteristics of part of malicious software are more, and the condition of characteristic redundancy can occur, so that the calculation complexity of the model is high, and how to efficiently screen the characteristics of the malicious software and realize the dimension reduction of the screened characteristics is very important for detecting the malicious software;

3. in the detection of malicious software, a single detection model is easy to fall into a local optimal point in the training process, so that the generalization performance of the model is weak, the detection effect is poor, and how to fuse the advantages of a plurality of single models is obviously a problem worthy of research.

Disclosure of Invention

In order to solve the technical problem, the invention provides an Android malicious software detection method based on a stacking integration technology, which comprises the following steps of:

s1: acquiring Android software APK data samples with normal and malicious software labels from a mobile terminal;

s2: generating a malicious software sample by adopting a mixed sample generation method, and constructing a sample data set with the quantity balance between the malicious software sample and the normal software sample according to the acquired Android software APK data sample with normal and malicious software labels and the generated malicious software sample;

s3: analyzing each software sample in the sample data set to obtain the software characteristics of each software sample;

s4: calculating an information entropy gain value of each software feature, judging the contribution degree of each software feature to the detection of the malicious software according to the size of the information entropy gain value, and extracting the front k items of features with the maximum contribution degree;

s5: performing deep dimensionality reduction on the extracted front k features by using a principal component analysis method to obtain a key feature set, and dividing the key feature set into a training set and a test set;

s6: establishing an AM-Stacking malicious software detection model, wherein the AM-Stacking malicious software detection model comprises the following steps: the base learners KNN, LR, RF of the first layer and the meta-learner GBDT of the second layer;

s7: dividing the training set by using a K-fold cross validation method to obtain a plurality of training sets, respectively training first-layer base learners KNN, LR and RF of the AM-Stacking malicious software detection model according to the plurality of divided training sets, and obtaining training parameters of three base learners under different training sets and generated new training subsets;

s8: training a second-layer meta-learner GBDT of the AM-Stacking malicious software detection model by utilizing a new training subset generated by the base learner to obtain training parameters of the meta-learner GBDT;

s9: respectively detecting the test data sets according to training parameters of the base learner under different training sets, and performing weight division and integration on the detected results by using an attention mechanism to obtain a new test data set;

s10: and (4) performing parameter adjustment on the GBDT according to the obtained training parameters of the GBDT, inputting a new test data set into the parameter-adjusted GBDT for further detection, and detecting the malicious software.

Preferably, the malware sample is generated by using a mixed sample generation method, which is expressed as:

wherein, M _new Representing new malware samples, m representing the number of new malware samples, G representing the generation of a countermeasure network, z representing the noise vector generated by a boundary synthesis minority over-sampling technique,

M _i of a representationThe boundary samples are taken as a sample of the boundary,

represents M _i Is given as a K neighbor sample, δ represents a random number from 0 to 1.

Preferably, an information entropy gain value for each software feature is calculated, expressed as:

IG(F _i )＝H(S ^* )-H(S ^* |F _i )

wherein, IG (F) _i ) Represents the ith feature F _i The information gain value of (1), H (S) ^* ) The entropy of the information representing the entire prediction system,

k represents the number of classifications for the system,

represents the variable S ^* The possible values of the number of possible values,

representing variables

Probability of (d), H (S x | F) _i ) Representing the conditional information entropy when each feature is individually a classification feature,

Value(F _i ) Represents the ith feature F _i All possible values, P (Value (F) _i ) Denotes the probability that the ith feature is taken at a certain value.

Preferably, the extracted top k term features are subjected to depth dimensionality reduction by a principal component analysis method, which is expressed as:

converting the extracted first k items of characteristics into characteristics in a vector form, centralizing each characteristic vector, constructing a covariance matrix according to the centralized characteristic vectors, performing characteristic value decomposition on the covariance matrix, sequencing according to the sequence of characteristic values from large to small, taking characteristic vectors corresponding to the first k characteristic values, mapping the taken first k characteristic vectors into a k-dimensional sample characteristic matrix, and forming sample characteristics in the sample characteristic matrix into a key characteristic set.

Preferably, each feature vector is centered and represented as:

wherein X _i Representing the feature vector after the centering,

represents the ith feature vector screened by comparing the information entropy gain values of each software feature, mu represents the average of the centered feature vectors,

n represents the number of feature vectors,

and the vector is composed of all the characteristic values which can be taken by the ith characteristic.

Preferably, a K-fold cross validation method is used to divide the training set, which is expressed as:

wherein D is _train(i) Represents a training set D _train The ith data subset D _i Training subsets obtained after truncation, D _valid(i) For the ith training subset D _i The corresponding verification set.

Preferably, the training parameters of the three base learners under different training sets and the generated new training subsets are expressed as:

P _mi ＝LM _m (D _train(i) ),(i＝1,2...,5),(m＝1,2,3)

wherein, P _mi Represents the training parameters, LM, obtained after the mth base learner has trained through the ith training subset _m It is shown that the m-th basis learner,

representing the mth base learner according to the training parameter P _mi And verifying the new training subset obtained by the ith verification set.

Preferably, the second-layer meta-learner GBDT is trained to obtain training parameters of the meta-learner GBDT, which are expressed as:

wherein P' represents the meta-learner through the new training subset

The training parameters obtained after training, LM, represent the meta-learner.

Preferably, the S9 specifically includes:

wherein, test _i Representing a new test set resulting from precision weighted summation by each basis learner,

representing each base learner for the same test set D under different training parameters _test Predicted result, β _i Denotes the attention mechanism weight score for each base learner, q denotes the query vector,

representing each base learner detecting test set D under different training parameters _test The resulting data set is collectively referred to as attSum () representing the precision weighted sum function in the attention mechanism.

Preferably, the new test data set is input into the meta-learner GBDT after parameter tuning for further detection, and malware is detected, which is expressed as:

R＝P′(Test)

wherein, R represents the final detection result, test represents a new Test set obtained by precision weighted summation of the first-layer base learner, and P' is a training parameter of the meta learner.

The invention has the beneficial effects that: according to the invention, through acquiring data, carrying out equalization on a malicious software sample and a normal software sample on the acquired data, extracting the characteristics of the equalized data, screening and dimension reduction according to the information entropy gain value of each characteristic, fusing a plurality of models with good classification performance in a stacking integration mode, dividing a data set by using K-fold cross validation, introducing an attention mechanism, and combining the equalized data sample and a mixed characteristic processing method IG-PCA, the existence of malicious software can be more accurately detected.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a schematic diagram of an equalized data sample BS-GAN according to the present invention;

FIG. 3 is a schematic diagram of the present invention for extracting key features IG-PCA;

fig. 4 is a schematic diagram of the stacking model of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

An Android malware detection method based on a stack integration technology is shown in fig. 1 and includes:

s7: dividing the training set by using a K-fold cross validation method to obtain a plurality of training sets, respectively training first-layer base learners KNN, LR and RF of the AM-Stacking malicious software detection model according to the divided training sets to obtain training parameters of three base learners under different training sets and generated new training subsets;

s9: respectively detecting the test data sets according to training parameters of the base learner in different training sets, and performing weight division and integration on the detected results by using an attention mechanism to obtain a new test data set;

The data acquisition mode can be from a public data website or directly inquiring a data set disclosed by network security malicious software, and Android software APK data samples with labels of the mobile terminal need to be acquired.

In the process of malicious software detection, the number of samples of normal software and malicious software is often greatly different, so that the obtained data samples are unbalanced, and the detection effect of the model is poor. Aiming at the relevance among samples and the distribution condition of the samples, a BS-GAN mixed sample generation method is provided, the BS-GAN mixed sample generation method is a method for combining a boundary synthesis minority class oversampling technology (Borderline-SMOTE) with a generation countermeasure network (GAN), as shown in FIG. 2, the boundary synthesis minority class oversampling technology is used for generating a noise vector, the noise vector is input into the GAN network, a new malicious sample is obtained, and the imbalance of data samples is relieved.

First, the present invention will generate noise vectors by a boundary synthesis over-sampling technique of a few classes. Assume the entire sample set is S, the malware sample set is M, the normal software sample set is T, and M = { M = { M = ₁ ，M ₂ ，…，M _i ,…}，T＝{T ₁ ，T ₂ ，…，T _i ，…}。

Solving M distance to each malicious sample by a K nearest neighbor method _i The most recent K samples, the calculation formula is as follows:

wherein Y is _i And M _i Respectively two samples in space, dist (M, Y) represents the euclidean distance between the two samples.

K 'in K neighbors belongs to normal samples, and obviously K' is more than or equal to 0 and less than or equal to K. If K/2 is less than or equal to K' and less than or equal to K, the malicious sample is called a boundary sample (Danger). Because the boundary samples are more easily classified by mistake, the algorithm only carries out the synthesis processing of new samples on the randomly selected boundary samples, and then obtains the noise vector as:

wherein M is _i Is the boundary sample that is selected and,

is M _i Is a random number from 0 to 1.

Then, the invention inputs the generated noise vector into a generation countermeasure (GAN), and generates a new malicious sample M through a generator G _new Comprises the following steps:

wherein M is _new Representing new malware samples, m representing the number of new malware samples, G representing the generation of a countermeasure network, and z representing the noise vector generated by a boundary synthesis minority over-sampling technique.

As is known, the problem of insufficient expressive force often occurs to basic features selected manually and autonomously, and the problem of feature redundancy exists to directly use all the features, in order to more effectively screen out key features, the invention introduces a mixed feature processing method IG-PCA to learn the interrelation among a plurality of features and the contribution degree of each feature to the detection of malicious software, as shown in fig. 3, the purpose of screening out a plurality of key features is achieved by screening out features of a software sample in a plurality of aspects.

The information entropy gain value is a statistic which can visually show the capability of a certain attribute or characteristic to distinguish a certain type of data sample. When calculating the information entropy gain value of a certain software characteristic, firstly, it needs to calculate the whole information amount of the data sample when all the software data samples have the same characteristic, then, it calculates the whole information amount when the characteristic of each software is different, and finally, the difference value obtained by the two is the information entropy gain value of the obtained characteristic. The larger the information entropy gain value is, the larger the contribution degree of the classification to the data sample is, so that the contribution degree of each software feature to the detection of the malicious software can be judged according to the magnitude of the information entropy gain value.

Firstly, in the invention, the information gain value of each feature is calculated through the information entropy, and the top K features with the maximum information gain are extracted. Suppose that the equalized data sample is S ^* Will S ^* All the characteristics are integrated to obtain the whole characteristic set, and the whole characteristic set is set as F = { F = { (F) ₁ ,F ₂ ,...F _i Saving, and then calculating the information gain of each feature according to the information entropy, wherein the specific process is as follows:

the first step is as follows: calculating the information entropy H (S) of the whole prediction system according to the formula of the information entropy ^* )：

Wherein: k represents the number of the systematic classes,

representing the possible values of the variable S,

representing variables

The probability of (c).

The second step is that: respectively calculating conditional information entropy when each feature is independently used as classification feature, and calculating formula

Wherein: f _i Denotes the ith feature, value (F) _i ) Indicate the ith feature possibilityAll values of (A), (B), (C) and (D), P (Value (F) _i ) Denotes the probability that the ith feature is taken at a certain value.

The third step: calculating the information gain value of each feature, selecting the main feature according to the magnitude of the gain value, and defining the information gain of the feature as follows:

IG(F _i )＝H(S ^* )-H(S ^* |F _i )

wherein, IG (F) _i ) Represents the ith feature F _i The information gain value of (1), H (S) ^* ) Represents the information entropy of the whole prediction system, k represents the number of system classifications,

representing variables

Probability of (A), H (S) ^* |F _i ) Entropy (F), which represents the conditional information when each feature is used alone as a classification feature _i ) Represents the ith feature F _i All possible values, P (Value (F) _i ) Denotes the probability that the ith feature is taken at a certain value.

In practice, the main features extracted by the information entropy method still have very high dimensionality, which results in excessively high detection complexity of the model. Therefore, the extracted main feature set is subjected to deep dimension reduction by using a principal component analysis method.

Performing depth dimensionality reduction on the extracted front k term features by using a principal component analysis method, wherein the depth dimensionality reduction is expressed as follows:

converting the extracted first k items of features into features in a vector form, centralizing each feature vector, constructing a covariance matrix according to the centralized feature vectors, decomposing feature values of the covariance matrix, sorting the feature values in a descending order, taking the feature vectors corresponding to the first k feature values, mapping the taken first k feature vectors into a k-dimensional sample feature matrix, and forming a key feature set by the sample features in the sample feature matrix.

In order to eliminate the influence of the dimension on the covariance, each feature vector needs to be centered, and each feature vector needs to be centered, which is expressed as:

wherein, X _i Representing the feature vector after the centering,

n represents the number of feature vectors,

According to the centralized data sample, we can obtain a covariance matrix Q of the feature sample as:

wherein n represents the number of eigenvectors, X represents the feature vector matrix after centering, and X represents the feature vector matrix after centering ^T Representing the transpose of X.

Then, the covariance matrix Q is subjected to eigenvalue decomposition, the eigenvalues are sorted according to the arrangement sequence from large to small, and the eigenvectors A = (alpha) corresponding to the first k eigenvalues are taken out ₁ ,α ₂ ,α ₃ ,...α _k ) The n-dimensional feature sample can be mapped to the k-dimensional feature sample through mapping, and the mapping process of the k-dimensional sample feature matrix X' can be expressed as:

X′＝A ^T X

wherein X' represents a mapped k-dimensional sample feature matrix, A ^T And representing the transpose of the feature vectors corresponding to the first k feature values screened out, and X represents a feature vector matrix after centralization.

Finally, a key feature set for distinguishing the malware sample can be screened out as follows:

dividing the training set by using a K-fold cross validation method, wherein the method is represented as follows:

wherein D is _train(i) Represents the training set D _train The ith data subset D _i Training subsets obtained after truncation, D _valid(i) For the ith training subset D _i The corresponding verification set.

In malicious software detection, a traditional single detection model is easy to fall into a local optimal point in a training process, so that the generalization performance of the model is weak, and the detection effect is poor. Therefore, the invention utilizes the idea of integrating models to integrate a plurality of models with good classification performance in a Stacking integration (Stacking) mode from the perspective of model integration to realize the detection of malicious software, thereby making up for some defects of the traditional detection model. The main idea of Stacking is to integrate classification results of different learners by using one learner, and ensure the diversity of the learners through the difference of the learners. The invention mainly adopts a two-layer learner to classify unknown software data sets, as shown in fig. 4:

in the Stacking model, the learners of the first layer are called base learners, and the learners of the second layer are called meta-learners. For the input data set, the data set is divided into training sets (D) _train ) And test set (D) _test ). Then, the two layers need to be trained using the training data setsThe learner trains and adjusts parameters, and if the divided training set is directly used for training the base learner and the meta learner at the same time, the Stacking model can cause too high fitting risk due to repeated learning by using the training feature data set, so that the detection of malicious software is inaccurate.

P _mi ＝LM _m (D _train(i) ),(i＝1,2...,5),(m＝1,2,3)

representing the mth base learner from the training parameter P _mi And verifying the new training subset obtained by the ith verification set.

wherein P' represents the meta-learner through the new training subset

Preferably, the S9 specifically includes:

representing each base learner for the same test set D under different training parameters _test Predicted result,. Beta _i Denotes the attention mechanism weight score for each base learner, q denotes the query vector,

representing each base learner detecting the test set D under different training parameters _test The resulting data set is collectively referred to as attSum () which represents the precision weighted sum function in the attention mechanism.

Preferably, the new test data set is input into the meta learner GBDT after parameter adjustment for further detection, and malware is detected, which is represented as:

R＝P′(Test)

wherein, R represents the final detection result, test represents the new Test set obtained by the first-layer base learner through precision weighted summation, and P' is the training parameter of the meta learner.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for detecting Android malicious software based on a stacking integration technology is characterized by comprising the following steps:

s2: generating a malicious software sample by adopting a mixed sample generation method, and constructing a sample data set with the quantity of the malicious software sample balanced with that of a normal software sample according to the acquired Android software APK data sample with normal and malicious software labels and the generated malicious software sample;

s6: establishing an AM-Stacking malicious software detection model, wherein the AM-Stacking malicious software detection model comprises the following steps: the base learners KNN, LR and RF of the first layer and the meta-learner GBDT of the second layer;

s8: training a second-layer meta-learner GBDT of the AM-Stacking malicious software detection model by using a new training subset generated by the base learner to obtain training parameters of the meta-learner GBDT;

2. The Android malware detection method based on the stacking integration technology of claim 1, wherein the malware sample is generated by a mixed sample generation method, and is represented as:

wherein M is _new Representing new malware samples, m representing the number of new malware samples, G representing the generation of a countermeasure network, z representing the noise vector generated by a boundary synthesis minority over-sampling technique,

M _i the boundary samples of the representation are,

represents M _i And δ represents a random number from 0 to 1.

3. The Android malware detection method based on stacking integration technology of claim 1, wherein an information entropy gain value of each software feature is calculated as:

IG(F _i )＝H(S ^* )-H(S ^* |F _i )

k represents the number of the systematic classes,

represents the variable S ^* The possible values of which are,

representing variables

Probability of (A), H (S) ^* |F _i ) Indicates the conditional information entropy when each feature is individually used as a classification feature,

represents the ith feature F _i All possible values, P (Value (F) _i ) Denotes the probability that the ith feature is taken to be a certain value.

4. The Android malware detection method based on stacking integration technology of claim 1, wherein deep dimensionality reduction is performed on the extracted top k features by using a principal component analysis method, and is expressed as:

5. The Android malware detection method based on stacking integration technology of claim 4, wherein each feature vector is centralized and represented as:

wherein, X _i Representing the feature vector after the centering,

n represents the number of feature vectors,

6. The Android malware detection method based on stacking integration technology of claim 1, wherein a training set is divided by a K-fold cross validation method, and the method is represented as follows:

7. The Android malware detection method based on the stack integration technology as claimed in claim 1, wherein training parameters of three base learners under different training sets and generated new training subsets are expressed as:

P _mi ＝LM _m (D _train(i) ),(i＝1,2...,5),(m＝1,2,3)

8. The Android malware detection method based on the stack integration technology as claimed in claim 1, wherein a second-layer meta-learner GBDT is trained, and training parameters of the second-layer meta-learner GBDT are obtained and are expressed as:

wherein P' represents the meta learner through the newly trained subset

9. The Android malware detection method based on the stack integration technology as claimed in claim 1, wherein the S9 specifically includes:

wherein, test _i Representing a new test set obtained by precision weighted summation by each basis learner,

showing that each base learner is on the same test set D under different training parameters _test Predicted result, β _i Denotes the attention mechanism weight score for each base learner, q denotes the query vector,

10. The Android malware detection method based on the stacking integration technology as claimed in claim 1, wherein a new test data set is input into a meta-learner GBDT after parameter tuning for further detection, and malware is detected and represented as:

R＝P′(Test)