CN115577357A - Android malicious software detection method based on stacking integration technology - Google Patents

Android malicious software detection method based on stacking integration technology Download PDF

Info

Publication number
CN115577357A
CN115577357A CN202211221244.3A CN202211221244A CN115577357A CN 115577357 A CN115577357 A CN 115577357A CN 202211221244 A CN202211221244 A CN 202211221244A CN 115577357 A CN115577357 A CN 115577357A
Authority
CN
China
Prior art keywords
training
learner
software
feature
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211221244.3A
Other languages
Chinese (zh)
Inventor
刘红
李娟�
陈莉
肖云鹏
李暾
李茜
庞育才
陈南羽
马婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202211221244.3A priority Critical patent/CN115577357A/en
Publication of CN115577357A publication Critical patent/CN115577357A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/561Virus type analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Virology (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of computer security, and particularly relates to an Android malicious software detection method based on a stacking integration technology, which comprises the following steps of: the method comprises the steps of obtaining an Android software APK data sample with a label of the mobile terminal, carrying out data equalization on the obtained data sample, extracting characteristics of equalized data, carrying out screening and dimension reduction according to an information entropy gain value of each characteristic, establishing an AM-Stacking malicious software detection model, and carrying out malicious software detection according to the screened characteristics. According to the invention, a plurality of models with good classification performance are fused in a stacking and integrating manner, a data set is divided by using K-fold cross validation, an attention mechanism is introduced, and the existence of malicious software can be detected more accurately and the malicious software in each existing large software platform and terminal can be effectively detected by combining a balanced data sample and a mixed characteristic processing method, so that the detection precision of the malicious software is improved.

Description

Android malicious software detection method based on stacking integration technology
Technical Field
The invention belongs to the technical field of computer security, and particularly relates to an Android malicious software detection method based on a stacking integration technology.
Background
The Android system is an operating system mainly for smart phones, which is introduced by Google, and according to survey data in recent years, the Android accounts for over 85% of the market share, and has gradually penetrated into other manufacturing fields. However, due to the open source of the Android, a large amount of malicious software is about to enter in the false environment, and a series of threat challenges are caused, wherein the threat challenges mainly include privacy disclosure, data theft, network spyware and the like, and significant economic losses are caused to users. Thus, malware detection is becoming an urgent and important issue, and is also becoming a very hot topic in the field of computational security.
For the problem of malware detection, the existing detection methods can be mainly divided into two categories: static detection and dynamic detection. Static detection is the identification of malware before an application is executed, while dynamic detection is the execution of malware detection tasks at runtime. Most scholars propose a static detection method because the dynamic detection method of the malware has high time complexity, high required cost and difficulty in detecting multipath malware.
In recent years, features extracted from configuration files and code files of software are mainly used as training samples of the models based on a traditional Android malicious software detection model, the extracted features of the malicious software are usually selected through manual intervention, and finally, a large amount of original information of the features of the malicious software is lost, so that the detection effect of the model is poor, and the detection precision of the model cannot be further improved. Secondly, a single model is usually selected for a traditional malware detection model, so that the robustness and the accuracy of the detection model need to be improved, and the generalization capability is weak.
The problems of the prior art are that:
1. in the field of malicious software detection, malicious software samples are relatively few, which causes the imbalance of the data proportion of normal software samples and malicious software samples, so that the detection performance of the model cannot achieve the expected effect;
2. the characteristics of part of malicious software are more, and the condition of characteristic redundancy can occur, so that the calculation complexity of the model is high, and how to efficiently screen the characteristics of the malicious software and realize the dimension reduction of the screened characteristics is very important for detecting the malicious software;
3. in the detection of malicious software, a single detection model is easy to fall into a local optimal point in the training process, so that the generalization performance of the model is weak, the detection effect is poor, and how to fuse the advantages of a plurality of single models is obviously a problem worthy of research.
Disclosure of Invention
In order to solve the technical problem, the invention provides an Android malicious software detection method based on a stacking integration technology, which comprises the following steps of:
s1: acquiring Android software APK data samples with normal and malicious software labels from a mobile terminal;
s2: generating a malicious software sample by adopting a mixed sample generation method, and constructing a sample data set with the quantity balance between the malicious software sample and the normal software sample according to the acquired Android software APK data sample with normal and malicious software labels and the generated malicious software sample;
s3: analyzing each software sample in the sample data set to obtain the software characteristics of each software sample;
s4: calculating an information entropy gain value of each software feature, judging the contribution degree of each software feature to the detection of the malicious software according to the size of the information entropy gain value, and extracting the front k items of features with the maximum contribution degree;
s5: performing deep dimensionality reduction on the extracted front k features by using a principal component analysis method to obtain a key feature set, and dividing the key feature set into a training set and a test set;
s6: establishing an AM-Stacking malicious software detection model, wherein the AM-Stacking malicious software detection model comprises the following steps: the base learners KNN, LR, RF of the first layer and the meta-learner GBDT of the second layer;
s7: dividing the training set by using a K-fold cross validation method to obtain a plurality of training sets, respectively training first-layer base learners KNN, LR and RF of the AM-Stacking malicious software detection model according to the plurality of divided training sets, and obtaining training parameters of three base learners under different training sets and generated new training subsets;
s8: training a second-layer meta-learner GBDT of the AM-Stacking malicious software detection model by utilizing a new training subset generated by the base learner to obtain training parameters of the meta-learner GBDT;
s9: respectively detecting the test data sets according to training parameters of the base learner under different training sets, and performing weight division and integration on the detected results by using an attention mechanism to obtain a new test data set;
s10: and (4) performing parameter adjustment on the GBDT according to the obtained training parameters of the GBDT, inputting a new test data set into the parameter-adjusted GBDT for further detection, and detecting the malicious software.
Preferably, the malware sample is generated by using a mixed sample generation method, which is expressed as:
Figure BDA0003878285440000031
wherein, M new Representing new malware samples, m representing the number of new malware samples, G representing the generation of a countermeasure network, z representing the noise vector generated by a boundary synthesis minority over-sampling technique,
Figure BDA0003878285440000032
M i of a representationThe boundary samples are taken as a sample of the boundary,
Figure BDA0003878285440000033
represents M i Is given as a K neighbor sample, δ represents a random number from 0 to 1.
Preferably, an information entropy gain value for each software feature is calculated, expressed as:
IG(F i )=H(S * )-H(S * |F i )
wherein, IG (F) i ) Represents the ith feature F i The information gain value of (1), H (S) * ) The entropy of the information representing the entire prediction system,
Figure BDA0003878285440000034
k represents the number of classifications for the system,
Figure BDA0003878285440000035
represents the variable S * The possible values of the number of possible values,
Figure BDA0003878285440000036
representing variables
Figure BDA0003878285440000037
Probability of (d), H (S x | F) i ) Representing the conditional information entropy when each feature is individually a classification feature,
Figure BDA0003878285440000038
Value(F i ) Represents the ith feature F i All possible values, P (Value (F) i ) Denotes the probability that the ith feature is taken at a certain value.
Preferably, the extracted top k term features are subjected to depth dimensionality reduction by a principal component analysis method, which is expressed as:
converting the extracted first k items of characteristics into characteristics in a vector form, centralizing each characteristic vector, constructing a covariance matrix according to the centralized characteristic vectors, performing characteristic value decomposition on the covariance matrix, sequencing according to the sequence of characteristic values from large to small, taking characteristic vectors corresponding to the first k characteristic values, mapping the taken first k characteristic vectors into a k-dimensional sample characteristic matrix, and forming sample characteristics in the sample characteristic matrix into a key characteristic set.
Preferably, each feature vector is centered and represented as:
Figure BDA0003878285440000041
wherein X i Representing the feature vector after the centering,
Figure BDA0003878285440000042
represents the ith feature vector screened by comparing the information entropy gain values of each software feature, mu represents the average of the centered feature vectors,
Figure BDA0003878285440000043
n represents the number of feature vectors,
Figure BDA0003878285440000044
and the vector is composed of all the characteristic values which can be taken by the ith characteristic.
Preferably, a K-fold cross validation method is used to divide the training set, which is expressed as:
Figure BDA0003878285440000045
wherein D is train(i) Represents a training set D train The ith data subset D i Training subsets obtained after truncation, D valid(i) For the ith training subset D i The corresponding verification set.
Preferably, the training parameters of the three base learners under different training sets and the generated new training subsets are expressed as:
P mi =LM m (D train(i) ),(i=1,2...,5),(m=1,2,3)
Figure BDA0003878285440000046
wherein, P mi Represents the training parameters, LM, obtained after the mth base learner has trained through the ith training subset m It is shown that the m-th basis learner,
Figure BDA0003878285440000047
representing the mth base learner according to the training parameter P mi And verifying the new training subset obtained by the ith verification set.
Preferably, the second-layer meta-learner GBDT is trained to obtain training parameters of the meta-learner GBDT, which are expressed as:
Figure BDA0003878285440000048
wherein P' represents the meta-learner through the new training subset
Figure BDA0003878285440000049
The training parameters obtained after training, LM, represent the meta-learner.
Preferably, the S9 specifically includes:
Figure BDA00038782854400000410
wherein, test i Representing a new test set resulting from precision weighted summation by each basis learner,
Figure BDA0003878285440000051
representing each base learner for the same test set D under different training parameters test Predicted result, β i Denotes the attention mechanism weight score for each base learner, q denotes the query vector,
Figure BDA0003878285440000052
representing each base learner detecting test set D under different training parameters test The resulting data set is collectively referred to as attSum () representing the precision weighted sum function in the attention mechanism.
Preferably, the new test data set is input into the meta-learner GBDT after parameter tuning for further detection, and malware is detected, which is expressed as:
R=P′(Test)
wherein, R represents the final detection result, test represents a new Test set obtained by precision weighted summation of the first-layer base learner, and P' is a training parameter of the meta learner.
The invention has the beneficial effects that: according to the invention, through acquiring data, carrying out equalization on a malicious software sample and a normal software sample on the acquired data, extracting the characteristics of the equalized data, screening and dimension reduction according to the information entropy gain value of each characteristic, fusing a plurality of models with good classification performance in a stacking integration mode, dividing a data set by using K-fold cross validation, introducing an attention mechanism, and combining the equalized data sample and a mixed characteristic processing method IG-PCA, the existence of malicious software can be more accurately detected.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a schematic diagram of an equalized data sample BS-GAN according to the present invention;
FIG. 3 is a schematic diagram of the present invention for extracting key features IG-PCA;
fig. 4 is a schematic diagram of the stacking model of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
An Android malware detection method based on a stack integration technology is shown in fig. 1 and includes:
s1: acquiring Android software APK data samples with normal and malicious software labels from a mobile terminal;
s2: generating a malicious software sample by adopting a mixed sample generation method, and constructing a sample data set with the quantity balance between the malicious software sample and the normal software sample according to the acquired Android software APK data sample with normal and malicious software labels and the generated malicious software sample;
s3: analyzing each software sample in the sample data set to obtain the software characteristics of each software sample;
s4: calculating an information entropy gain value of each software feature, judging the contribution degree of each software feature to the detection of the malicious software according to the size of the information entropy gain value, and extracting the front k items of features with the maximum contribution degree;
s5: performing deep dimensionality reduction on the extracted front k features by using a principal component analysis method to obtain a key feature set, and dividing the key feature set into a training set and a test set;
s6: establishing an AM-Stacking malicious software detection model, wherein the AM-Stacking malicious software detection model comprises the following steps: the base learners KNN, LR, RF of the first layer and the meta-learner GBDT of the second layer;
s7: dividing the training set by using a K-fold cross validation method to obtain a plurality of training sets, respectively training first-layer base learners KNN, LR and RF of the AM-Stacking malicious software detection model according to the divided training sets to obtain training parameters of three base learners under different training sets and generated new training subsets;
s8: training a second-layer meta-learner GBDT of the AM-Stacking malicious software detection model by utilizing a new training subset generated by the base learner to obtain training parameters of the meta-learner GBDT;
s9: respectively detecting the test data sets according to training parameters of the base learner in different training sets, and performing weight division and integration on the detected results by using an attention mechanism to obtain a new test data set;
s10: and (4) performing parameter adjustment on the GBDT according to the obtained training parameters of the GBDT, inputting a new test data set into the parameter-adjusted GBDT for further detection, and detecting the malicious software.
The data acquisition mode can be from a public data website or directly inquiring a data set disclosed by network security malicious software, and Android software APK data samples with labels of the mobile terminal need to be acquired.
In the process of malicious software detection, the number of samples of normal software and malicious software is often greatly different, so that the obtained data samples are unbalanced, and the detection effect of the model is poor. Aiming at the relevance among samples and the distribution condition of the samples, a BS-GAN mixed sample generation method is provided, the BS-GAN mixed sample generation method is a method for combining a boundary synthesis minority class oversampling technology (Borderline-SMOTE) with a generation countermeasure network (GAN), as shown in FIG. 2, the boundary synthesis minority class oversampling technology is used for generating a noise vector, the noise vector is input into the GAN network, a new malicious sample is obtained, and the imbalance of data samples is relieved.
First, the present invention will generate noise vectors by a boundary synthesis over-sampling technique of a few classes. Assume the entire sample set is S, the malware sample set is M, the normal software sample set is T, and M = { M = { M = 1 ,M 2 ,…,M i ,…},T={T 1 ,T 2 ,…,T i ,…}。
Solving M distance to each malicious sample by a K nearest neighbor method i The most recent K samples, the calculation formula is as follows:
Figure BDA0003878285440000071
wherein Y is i And M i Respectively two samples in space, dist (M, Y) represents the euclidean distance between the two samples.
K 'in K neighbors belongs to normal samples, and obviously K' is more than or equal to 0 and less than or equal to K. If K/2 is less than or equal to K' and less than or equal to K, the malicious sample is called a boundary sample (Danger). Because the boundary samples are more easily classified by mistake, the algorithm only carries out the synthesis processing of new samples on the randomly selected boundary samples, and then obtains the noise vector as:
Figure BDA0003878285440000072
wherein M is i Is the boundary sample that is selected and,
Figure BDA0003878285440000073
is M i Is a random number from 0 to 1.
Then, the invention inputs the generated noise vector into a generation countermeasure (GAN), and generates a new malicious sample M through a generator G new Comprises the following steps:
Figure BDA0003878285440000074
wherein M is new Representing new malware samples, m representing the number of new malware samples, G representing the generation of a countermeasure network, and z representing the noise vector generated by a boundary synthesis minority over-sampling technique.
As is known, the problem of insufficient expressive force often occurs to basic features selected manually and autonomously, and the problem of feature redundancy exists to directly use all the features, in order to more effectively screen out key features, the invention introduces a mixed feature processing method IG-PCA to learn the interrelation among a plurality of features and the contribution degree of each feature to the detection of malicious software, as shown in fig. 3, the purpose of screening out a plurality of key features is achieved by screening out features of a software sample in a plurality of aspects.
The information entropy gain value is a statistic which can visually show the capability of a certain attribute or characteristic to distinguish a certain type of data sample. When calculating the information entropy gain value of a certain software characteristic, firstly, it needs to calculate the whole information amount of the data sample when all the software data samples have the same characteristic, then, it calculates the whole information amount when the characteristic of each software is different, and finally, the difference value obtained by the two is the information entropy gain value of the obtained characteristic. The larger the information entropy gain value is, the larger the contribution degree of the classification to the data sample is, so that the contribution degree of each software feature to the detection of the malicious software can be judged according to the magnitude of the information entropy gain value.
Firstly, in the invention, the information gain value of each feature is calculated through the information entropy, and the top K features with the maximum information gain are extracted. Suppose that the equalized data sample is S * Will S * All the characteristics are integrated to obtain the whole characteristic set, and the whole characteristic set is set as F = { F = { (F) 1 ,F 2 ,...F i Saving, and then calculating the information gain of each feature according to the information entropy, wherein the specific process is as follows:
the first step is as follows: calculating the information entropy H (S) of the whole prediction system according to the formula of the information entropy * ):
Figure BDA0003878285440000081
Wherein: k represents the number of the systematic classes,
Figure BDA0003878285440000082
representing the possible values of the variable S,
Figure BDA0003878285440000083
representing variables
Figure BDA0003878285440000084
The probability of (c).
The second step is that: respectively calculating conditional information entropy when each feature is independently used as classification feature, and calculating formula
Figure BDA0003878285440000085
Wherein: f i Denotes the ith feature, value (F) i ) Indicate the ith feature possibilityAll values of (A), (B), (C) and (D), P (Value (F) i ) Denotes the probability that the ith feature is taken at a certain value.
The third step: calculating the information gain value of each feature, selecting the main feature according to the magnitude of the gain value, and defining the information gain of the feature as follows:
IG(F i )=H(S * )-H(S * |F i )
wherein, IG (F) i ) Represents the ith feature F i The information gain value of (1), H (S) * ) Represents the information entropy of the whole prediction system, k represents the number of system classifications,
Figure BDA0003878285440000091
represents the variable S * The possible values of the number of possible values,
Figure BDA0003878285440000092
representing variables
Figure BDA0003878285440000093
Probability of (A), H (S) * |F i ) Entropy (F), which represents the conditional information when each feature is used alone as a classification feature i ) Represents the ith feature F i All possible values, P (Value (F) i ) Denotes the probability that the ith feature is taken at a certain value.
In practice, the main features extracted by the information entropy method still have very high dimensionality, which results in excessively high detection complexity of the model. Therefore, the extracted main feature set is subjected to deep dimension reduction by using a principal component analysis method.
Performing depth dimensionality reduction on the extracted front k term features by using a principal component analysis method, wherein the depth dimensionality reduction is expressed as follows:
converting the extracted first k items of features into features in a vector form, centralizing each feature vector, constructing a covariance matrix according to the centralized feature vectors, decomposing feature values of the covariance matrix, sorting the feature values in a descending order, taking the feature vectors corresponding to the first k feature values, mapping the taken first k feature vectors into a k-dimensional sample feature matrix, and forming a key feature set by the sample features in the sample feature matrix.
In order to eliminate the influence of the dimension on the covariance, each feature vector needs to be centered, and each feature vector needs to be centered, which is expressed as:
Figure BDA0003878285440000094
wherein, X i Representing the feature vector after the centering,
Figure BDA0003878285440000095
represents the ith feature vector screened by comparing the information entropy gain values of each software feature, mu represents the average of the centered feature vectors,
Figure BDA0003878285440000096
n represents the number of feature vectors,
Figure BDA0003878285440000097
and the vector is composed of all the characteristic values which can be taken by the ith characteristic.
According to the centralized data sample, we can obtain a covariance matrix Q of the feature sample as:
Figure BDA0003878285440000098
wherein n represents the number of eigenvectors, X represents the feature vector matrix after centering, and X represents the feature vector matrix after centering T Representing the transpose of X.
Then, the covariance matrix Q is subjected to eigenvalue decomposition, the eigenvalues are sorted according to the arrangement sequence from large to small, and the eigenvectors A = (alpha) corresponding to the first k eigenvalues are taken out 123 ,...α k ) The n-dimensional feature sample can be mapped to the k-dimensional feature sample through mapping, and the mapping process of the k-dimensional sample feature matrix X' can be expressed as:
X′=A T X
wherein X' represents a mapped k-dimensional sample feature matrix, A T And representing the transpose of the feature vectors corresponding to the first k feature values screened out, and X represents a feature vector matrix after centralization.
Finally, a key feature set for distinguishing the malware sample can be screened out as follows:
Figure BDA0003878285440000101
dividing the training set by using a K-fold cross validation method, wherein the method is represented as follows:
Figure BDA0003878285440000102
wherein D is train(i) Represents the training set D train The ith data subset D i Training subsets obtained after truncation, D valid(i) For the ith training subset D i The corresponding verification set.
In malicious software detection, a traditional single detection model is easy to fall into a local optimal point in a training process, so that the generalization performance of the model is weak, and the detection effect is poor. Therefore, the invention utilizes the idea of integrating models to integrate a plurality of models with good classification performance in a Stacking integration (Stacking) mode from the perspective of model integration to realize the detection of malicious software, thereby making up for some defects of the traditional detection model. The main idea of Stacking is to integrate classification results of different learners by using one learner, and ensure the diversity of the learners through the difference of the learners. The invention mainly adopts a two-layer learner to classify unknown software data sets, as shown in fig. 4:
in the Stacking model, the learners of the first layer are called base learners, and the learners of the second layer are called meta-learners. For the input data set, the data set is divided into training sets (D) train ) And test set (D) test ). Then, the two layers need to be trained using the training data setsThe learner trains and adjusts parameters, and if the divided training set is directly used for training the base learner and the meta learner at the same time, the Stacking model can cause too high fitting risk due to repeated learning by using the training feature data set, so that the detection of malicious software is inaccurate.
Preferably, the training parameters of the three base learners under different training sets and the generated new training subsets are expressed as:
P mi =LM m (D train(i) ),(i=1,2...,5),(m=1,2,3)
Figure BDA0003878285440000111
wherein, P mi Represents the training parameters, LM, obtained after the mth base learner has trained through the ith training subset m It is shown that the m-th basis learner,
Figure BDA0003878285440000112
representing the mth base learner from the training parameter P mi And verifying the new training subset obtained by the ith verification set.
Preferably, the second-layer meta-learner GBDT is trained to obtain training parameters of the meta-learner GBDT, which are expressed as:
Figure BDA0003878285440000113
wherein P' represents the meta-learner through the new training subset
Figure BDA0003878285440000114
The training parameters obtained after training, LM, represent the meta-learner.
Preferably, the S9 specifically includes:
Figure BDA0003878285440000115
wherein, test i Representing a new test set resulting from precision weighted summation by each basis learner,
Figure BDA0003878285440000116
representing each base learner for the same test set D under different training parameters test Predicted result,. Beta i Denotes the attention mechanism weight score for each base learner, q denotes the query vector,
Figure BDA0003878285440000117
representing each base learner detecting the test set D under different training parameters test The resulting data set is collectively referred to as attSum () which represents the precision weighted sum function in the attention mechanism.
Preferably, the new test data set is input into the meta learner GBDT after parameter adjustment for further detection, and malware is detected, which is represented as:
R=P′(Test)
wherein, R represents the final detection result, test represents the new Test set obtained by the first-layer base learner through precision weighted summation, and P' is the training parameter of the meta learner.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A method for detecting Android malicious software based on a stacking integration technology is characterized by comprising the following steps:
s1: acquiring Android software APK data samples with normal and malicious software labels from a mobile terminal;
s2: generating a malicious software sample by adopting a mixed sample generation method, and constructing a sample data set with the quantity of the malicious software sample balanced with that of a normal software sample according to the acquired Android software APK data sample with normal and malicious software labels and the generated malicious software sample;
s3: analyzing each software sample in the sample data set to obtain the software characteristics of each software sample;
s4: calculating an information entropy gain value of each software feature, judging the contribution degree of each software feature to the detection of the malicious software according to the size of the information entropy gain value, and extracting the front k items of features with the maximum contribution degree;
s5: performing deep dimensionality reduction on the extracted front k features by using a principal component analysis method to obtain a key feature set, and dividing the key feature set into a training set and a test set;
s6: establishing an AM-Stacking malicious software detection model, wherein the AM-Stacking malicious software detection model comprises the following steps: the base learners KNN, LR and RF of the first layer and the meta-learner GBDT of the second layer;
s7: dividing the training set by using a K-fold cross validation method to obtain a plurality of training sets, respectively training first-layer base learners KNN, LR and RF of the AM-Stacking malicious software detection model according to the divided training sets to obtain training parameters of three base learners under different training sets and generated new training subsets;
s8: training a second-layer meta-learner GBDT of the AM-Stacking malicious software detection model by using a new training subset generated by the base learner to obtain training parameters of the meta-learner GBDT;
s9: respectively detecting the test data sets according to training parameters of the base learner under different training sets, and performing weight division and integration on the detected results by using an attention mechanism to obtain a new test data set;
s10: and (4) performing parameter adjustment on the GBDT according to the obtained training parameters of the GBDT, inputting a new test data set into the parameter-adjusted GBDT for further detection, and detecting the malicious software.
2. The Android malware detection method based on the stacking integration technology of claim 1, wherein the malware sample is generated by a mixed sample generation method, and is represented as:
Figure FDA0003878285430000021
wherein M is new Representing new malware samples, m representing the number of new malware samples, G representing the generation of a countermeasure network, z representing the noise vector generated by a boundary synthesis minority over-sampling technique,
Figure FDA0003878285430000022
M i the boundary samples of the representation are,
Figure FDA0003878285430000023
represents M i And δ represents a random number from 0 to 1.
3. The Android malware detection method based on stacking integration technology of claim 1, wherein an information entropy gain value of each software feature is calculated as:
IG(F i )=H(S * )-H(S * |F i )
wherein, IG (F) i ) Represents the ith feature F i The information gain value of (1), H (S) * ) The entropy of the information representing the entire prediction system,
Figure FDA0003878285430000024
k represents the number of the systematic classes,
Figure FDA0003878285430000025
represents the variable S * The possible values of which are,
Figure FDA0003878285430000026
representing variables
Figure FDA0003878285430000027
Probability of (A), H (S) * |F i ) Indicates the conditional information entropy when each feature is individually used as a classification feature,
Figure FDA0003878285430000028
represents the ith feature F i All possible values, P (Value (F) i ) Denotes the probability that the ith feature is taken to be a certain value.
4. The Android malware detection method based on stacking integration technology of claim 1, wherein deep dimensionality reduction is performed on the extracted top k features by using a principal component analysis method, and is expressed as:
converting the extracted first k items of characteristics into characteristics in a vector form, centralizing each characteristic vector, constructing a covariance matrix according to the centralized characteristic vectors, performing characteristic value decomposition on the covariance matrix, sequencing according to the sequence of characteristic values from large to small, taking characteristic vectors corresponding to the first k characteristic values, mapping the taken first k characteristic vectors into a k-dimensional sample characteristic matrix, and forming sample characteristics in the sample characteristic matrix into a key characteristic set.
5. The Android malware detection method based on stacking integration technology of claim 4, wherein each feature vector is centralized and represented as:
Figure FDA0003878285430000029
wherein, X i Representing the feature vector after the centering,
Figure FDA0003878285430000031
represents the ith feature vector screened by comparing the information entropy gain values of each software feature, mu represents the average of the centered feature vectors,
Figure FDA0003878285430000032
n represents the number of feature vectors,
Figure FDA0003878285430000033
and the vector is composed of all the characteristic values which can be taken by the ith characteristic.
6. The Android malware detection method based on stacking integration technology of claim 1, wherein a training set is divided by a K-fold cross validation method, and the method is represented as follows:
Figure FDA0003878285430000034
wherein D is train(i) Represents a training set D train The ith data subset D i Training subsets obtained after truncation, D valid(i) For the ith training subset D i The corresponding verification set.
7. The Android malware detection method based on the stack integration technology as claimed in claim 1, wherein training parameters of three base learners under different training sets and generated new training subsets are expressed as:
P mi =LM m (D train(i) ),(i=1,2...,5),(m=1,2,3)
Figure FDA0003878285430000035
wherein, P mi Represents the training parameters, LM, obtained after the mth base learner has trained through the ith training subset m It is shown that the m-th basis learner,
Figure FDA0003878285430000036
representing the mth base learner according to the training parameter P mi And verifying the new training subset obtained by the ith verification set.
8. The Android malware detection method based on the stack integration technology as claimed in claim 1, wherein a second-layer meta-learner GBDT is trained, and training parameters of the second-layer meta-learner GBDT are obtained and are expressed as:
Figure FDA0003878285430000037
wherein P' represents the meta learner through the newly trained subset
Figure FDA0003878285430000038
The training parameters obtained after training, LM, represent the meta-learner.
9. The Android malware detection method based on the stack integration technology as claimed in claim 1, wherein the S9 specifically includes:
Figure FDA0003878285430000041
wherein, test i Representing a new test set obtained by precision weighted summation by each basis learner,
Figure FDA0003878285430000042
showing that each base learner is on the same test set D under different training parameters test Predicted result, β i Denotes the attention mechanism weight score for each base learner, q denotes the query vector,
Figure FDA0003878285430000043
representing each base learner detecting test set D under different training parameters test The resulting data set is collectively referred to as attSum () representing the precision weighted sum function in the attention mechanism.
10. The Android malware detection method based on the stacking integration technology as claimed in claim 1, wherein a new test data set is input into a meta-learner GBDT after parameter tuning for further detection, and malware is detected and represented as:
R=P′(Test)
wherein, R represents the final detection result, test represents the new Test set obtained by the first-layer base learner through precision weighted summation, and P' is the training parameter of the meta learner.
CN202211221244.3A 2022-10-08 2022-10-08 Android malicious software detection method based on stacking integration technology Pending CN115577357A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211221244.3A CN115577357A (en) 2022-10-08 2022-10-08 Android malicious software detection method based on stacking integration technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211221244.3A CN115577357A (en) 2022-10-08 2022-10-08 Android malicious software detection method based on stacking integration technology

Publications (1)

Publication Number Publication Date
CN115577357A true CN115577357A (en) 2023-01-06

Family

ID=84582822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211221244.3A Pending CN115577357A (en) 2022-10-08 2022-10-08 Android malicious software detection method based on stacking integration technology

Country Status (1)

Country Link
CN (1) CN115577357A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304932A (en) * 2023-05-19 2023-06-23 湖南工商大学 Sample generation method, device, terminal equipment and medium
CN117577214A (en) * 2023-05-19 2024-02-20 广东工业大学 Compound blood brain barrier permeability prediction method based on stack learning algorithm

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304932A (en) * 2023-05-19 2023-06-23 湖南工商大学 Sample generation method, device, terminal equipment and medium
CN116304932B (en) * 2023-05-19 2023-09-05 湖南工商大学 Sample generation method, device, terminal equipment and medium
CN117577214A (en) * 2023-05-19 2024-02-20 广东工业大学 Compound blood brain barrier permeability prediction method based on stack learning algorithm
CN117577214B (en) * 2023-05-19 2024-04-12 广东工业大学 Compound blood brain barrier permeability prediction method based on stack learning algorithm

Similar Documents

Publication Publication Date Title
CN109408389B (en) Code defect detection method and device based on deep learning
CN112491796B (en) Intrusion detection and semantic decision tree quantitative interpretation method based on convolutional neural network
CN110135167B (en) Edge computing terminal security level evaluation method for random forest
CN109299741B (en) Network attack type identification method based on multi-layer detection
CN115577357A (en) Android malicious software detection method based on stacking integration technology
CN108154178A (en) Semi-supervised support attack detection method based on improved SVM-KNN algorithms
CN109886284B (en) Fraud detection method and system based on hierarchical clustering
CN111915437A (en) RNN-based anti-money laundering model training method, device, equipment and medium
CN111062036A (en) Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment
CN113887643B (en) New dialogue intention recognition method based on pseudo tag self-training and source domain retraining
CN113922985A (en) Network intrusion detection method and system based on ensemble learning
CN114492768A (en) Twin capsule network intrusion detection method based on small sample learning
Chen et al. ADASYN− Random forest based intrusion detection model
CN111310185B (en) Android malicious software detection method based on improved stacking algorithm
CN111047173A (en) Community credibility evaluation method based on improved D-S evidence theory
CN113592103A (en) Software malicious behavior identification method based on integrated learning and dynamic analysis
CN113052577A (en) Method and system for estimating category of virtual address of block chain digital currency
CN115801374A (en) Network intrusion data classification method and device, electronic equipment and storage medium
Zhou et al. Credit card fraud identification based on principal component analysis and improved AdaBoost algorithm
CN114139636B (en) Abnormal operation processing method and device
Lu et al. Multi-class malware classification using deep residual network with non-softmax classifier
CN103761433A (en) Network service resource classifying method
KR102212310B1 (en) System and method for detecting of Incorrect Triple
CN113537313A (en) Unbalanced data set analysis method based on WGAN training convergence
CN111581640A (en) Malicious software detection method, device and equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination