CN113901463B

CN113901463B - Concept drift-oriented interpretable Android malicious software detection method

Info

Publication number: CN113901463B
Application number: CN202111033119.5A
Authority: CN
Inventors: 张炳; 文峥; 高原; 赵旭阳; 任家东
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2023-06-30
Anticipated expiration: 2041-09-03
Also published as: CN113901463A

Abstract

The invention discloses a concept drift-oriented interpretable Android malicious software detection method, which belongs to the technical field of information security and comprises the steps of introducing detection features through an artificial Android malicious software analysis report, improving traditional feature packaging based on an automatic machine learning algorithm and an interpretable algorithm, and fusing a same distribution inspection and migration learning algorithm. The method improves the interpretability of the Android malicious software detection model, is beneficial to the reverse analyst to manually verify the detection model, reduces the influence of the concept drift problem on the accuracy of the detection model, is beneficial to the detection model to maintain high accuracy for a long time at low cost, and is used for detecting and analyzing the Android malicious application software.

Description

Concept drift-oriented interpretable Android malicious software detection method

Technical Field

The invention relates to the technical field of information security, in particular to a concept drift-oriented interpretable Android malicious software detection method.

Background

About 206.5 of newly added malicious program samples of the mobile terminal are intercepted by the 360 internet security center in quarter 1 of 2021, which is 426.5% higher than the period of 2020, so that the economic loss of people is 14611 yuan. By the year of 2021, 4 months, compared with the iOS operating system, the Android operating system occupies 76.91% of the mobile terminal market in China, and the application software of the Android open platform is ecological, so that the Android open platform is more vulnerable to malicious software threat.

The existing Android malicious software detection technology is divided into three categories, namely a detection technology based on feature codes, a static detection technology based on machine learning and an application behavior detection technology based on machine learning. The sandbox mechanism of the Android system makes the dynamic behavior monitoring of applications in the non-customized system difficult. Static detection technology based on machine learning becomes a mainstream Android malicious software detection method because of the advantages of high detection accuracy of unknown malicious software, low requirements on equipment hardware and the like.

However, the static detection technology based on machine learning has 3 main problems as follows:

1. the application proportion of request sensitive rights in the application market is decreasing, and part of malicious applications can finish attack on the basis of not applying new rights. A single authority feature, or a combination of features without logic introduction, is not sufficient to characterize malware.

2. While the machine learning algorithm of the black box obtains higher and higher accuracy, the interpretive and transparency requirements of malicious application detection on the model are higher and higher. The Android malicious software reverse personnel need the model to provide decision basis so as to promote manual analysis or judge the rationality of model decision.

3. The high-frequency updating of the Android system version leads to a certain market share of Android applications developed based on software development kits of each version. And due to the concept drift phenomenon, the machine learning model trained at the cost of a large number of samples is poor in detection of Android malicious software in different periods.

Disclosure of Invention

The technical problem to be solved by the invention is to provide the method for detecting the interpretable Android malicious software for the concept drift, so that the interpretability of the Android malicious software detection model is improved, and the influence of the concept drift problem on the accuracy of the detection model is reduced.

In order to solve the technical problems, the invention adopts the following technical scheme:

a concept drift-oriented interpretable Android malicious software detection method comprises the following steps:

step 1, collecting a plurality of manual Android malicious application software analysis reports to form an Android malicious application software manual analysis report sample library;

step 2, collecting a plurality of malicious and benign Android application software samples to form an initial Android application software sample library, wherein the number of the malicious samples is consistent with that of the benign samples;

step 3, extracting the reverse analysis high-frequency words of the Android malicious application software from the Android malicious application software manual analysis report library, wherein the valid words of the first rank A are used as feature types used by a detection model;

step 4, according to an initial Android application software sample library, an automatic machine learning algorithm is used, feature vectors are constructed and screened corresponding to feature types used by each detection model, and feature component screening models are trained, wherein the number of the feature component screening models is A;

step 5, according to each characteristic component screening model, calculating the saproline average absolute value of all components in the screening characteristic vectors by using an interpretable machine learning algorithm, wherein the component with the top B rank is used as a sub-characteristic vector used by a detection model;

step 6, merging all the sub-feature vectors used by the detection model to be used as the features used by the detection model; extracting feature corresponding data used by a detection model according to an initial Android application software sample library to form an initial training data set;

step 7, training an initial detection model by using a machine learning algorithm based on a tree model on the initial training data set, and outputting the used characteristics of the detection model as the basis of manually verifying the detection model;

step 8, extracting feature corresponding data used by a detection model for the Android malicious software with unknown security, inputting the feature corresponding data into a trained initial detection model, and detecting whether the application is the Android malicious software;

step 9, acquiring Android malicious software samples by using a crawler technology according to main stream application markets and safe websites at home and abroad, and forming a model migration malicious software sample library, wherein the collection date of the malicious software publishing time interval is not more than C months, and the number of the malicious software is D;

step 10, extracting feature corresponding data used by a detection model according to a model migration malicious software sample library to form a model migration data set;

step 11, calculating test statistics by using a same distribution test algorithm according to the model migration data set and the initial training data set, and judging whether the Android malicious software has concept drift or not;

step 12, if the Android malicious software has concept drift, migrating an initial detection model by using a migration learning field self-adaptive algorithm, iterating E times, and training a new detection model to replace the initial detection model;

and step 13, repeatedly executing the steps 8-12 by taking the time interval C month as a period, updating the detection model, and detecting the Android malicious software.

The technical scheme of the invention is further improved as follows: in step 3, the method for extracting the Android malicious application software to reversely analyze the high-frequency words is a word frequency statistical algorithm, and the valid words of the first A are Android programming language keywords.

The technical scheme of the invention is further improved as follows: in step 4, the following sub-steps are included:

4.1 projecting a feature type used by a detection model from an initial Android application software sample library;

4.2 if the feature is projected, selecting a feature which is not projected in the feature types used by the detection model, and executing the step 4.1;

4.3, if the feature is not projected, using all the different data of the feature contained in the projected data as a screening feature vector of the feature; constructing a feature component screening dataset, wherein the feature component screening dataset comprises sample feature vectors of all samples;

4.4 inputting the feature component screening data set into an automatic machine learning algorithm, and selecting a pipeline with highest accuracy in output pipelines as a feature component screening model of the feature;

4.5 if there is a feature type of the non-output feature component screening model, executing the step 4.1.

The technical scheme of the invention is further improved as follows: in step 4.4, the automatic machine learning algorithm is a TPOT automatic machine learning algorithm, and the pipeline with the highest accuracy in the output pipeline is selected to apply a tree-based machine learning model.

The technical scheme of the invention is further improved as follows: in step 5, the sum of the average absolute values of saprolide of the components in the top B is not less than F times of the sum of the average absolute values of saprolide of the remaining components, and F is a positive integer not less than 4.

The technical scheme of the invention is further improved as follows: in step 5, the interpretable machine learning algorithm is a SHAP algorithm.

The technical scheme of the invention is further improved as follows: in the steps 6, 8 and 10, extracting feature corresponding data used by the detection model, using a reverse tool Android according to the features used by the detection model, matching an Android application software APK decompressed file, and recording the occurrence times if the features used by the detection model appear in the Android application software APK decompressed file; otherwise, marking as 0 to generate a sequence, and adding 1 after the sequence to the malicious sample; otherwise, 0 is added as a detection model sample feature vector.

The technical scheme of the invention is further improved as follows: in step 7, the machine learning algorithm based on the tree model is a Catboost algorithm.

The technical scheme of the invention is further improved as follows: in step 12, the adaptive algorithm is a JDA algorithm.

The technical scheme of the invention is further improved as follows: a is a positive integer of not less than 4, B is a positive integer of not less than 1, C is a positive integer of not less than 1, D is a positive integer of not less than 100, and E is a positive integer of not more than 5.

By adopting the technical scheme, the invention has the following technical progress:

1. according to the method, the high-frequency words in the attack flow are extracted through the Android malicious software analysis report, and various features of a source code level and an assembly instruction level are introduced, so that the high logic and rationality of the detection features are improved.

2. According to the method, the initial characteristics are combined, so that low storage cost and higher analysis speed are ensured, and malicious software can be better represented.

3. According to the invention, the optimal machine learning classification model based on the tree is screened by using an automatic machine learning algorithm, and compared with a parameter adjustment process in the traditional machine learning model training technology, the method and the device enhance the degree of fit between training data and the model, and improve convenience and efficiency.

4. The method uses an interpretability algorithm to construct a detection model interpretation mechanism, and the screened features have high contribution to the classification results of a plurality of training samples, so that the interpretability and verifiability of the detection model are ensured.

5. According to the invention, a field self-adaptive method is introduced in the technical field of information security, in particular in the Andorid malicious software detection technology, and a small amount of novel Android malicious software is used according to the existing data and detection model, so that the time sequence stability of the model detection accuracy is ensured, and the concept drift problem in Android malicious software detection is effectively relieved.

Drawings

FIG. 1 is a flow chart of a detection method of the present invention;

fig. 2 is a sub-flowchart of the construction of the feature component screening model in the present invention.

Detailed Description

The invention is described in further detail below with reference to the attached drawings and examples:

as shown in fig. 1, a concept drift-oriented interpretable Android malicious software detection method specifically includes the following steps:

and step 1, collecting a sufficient amount of manual Android malicious application software analysis reports to form an Android malicious application software manual analysis report sample library.

According to the embodiment, android malicious software analysis reports are sampled from a Kharon data set to form an Android malicious software manual analysis report sample library, and the Android malicious software analysis report language is English, wherein the total number of words is 4957.

And 2, collecting a sufficient quantity of malicious and benign Android application software samples to form an initial Android application software sample library, wherein the quantity of the malicious samples is consistent with that of the benign samples.

In the embodiment, 2900 Android malicious software are collected from an omniproid data set, 2900 Android malicious software are benign software, and Android application software samples are in an APK format. The malicious software is defined as that more than 50% of the antivirus engine detection results in the VIRUSTOTAL website are positive, and the benign software is defined as that more than or equal to 50% of the antivirus engine detection results in the VIRUSTOTAL website are negative.

And 3, extracting the Android malicious application software from the Android malicious application software manual analysis report library to reversely analyze high-frequency words, wherein the effective words A in the ranking are positive integers not less than 4, and the effective words A are used as feature types for a detection model.

According to the embodiment, a word frequency statistics method is used, android malicious application software is extracted to reversely analyze high-frequency words, A is 4 in the embodiment, effective words of feature types used as detection models are Android programming language keywords, and the feature types used by the detection models are authority, API package names, meaning names and Dalvik byte codes which are extracted from an Android malicious application software manual analysis report library from which nonsensical words such as articles, pronouns and quantity words are removed. Wherein removing words includes, but is not limited to: the, is, to, a, and, in, of, also, from.

And 4, constructing screening feature vectors corresponding to the feature types used by each detection model by using an automatic machine learning algorithm according to an initial Android application software sample library, and training feature component screening models, wherein the number of the feature component screening models is A.

As shown in fig. 2, the method specifically comprises the following substeps:

4.1 projecting a feature type used by the detection model from an initial Android application software sample library.

4.2 if the feature has been projected, selecting a feature not projected from the feature classes used by the detection model, and executing step 4.1.

4.3 if the feature is not projected, using all the different data of the feature contained in the projected data as the screening feature vector of the feature. A feature component screening dataset is constructed comprising sample feature vectors for all samples.

In this embodiment, each sample in the initial Android application software sample library contains 45 features such as an installation package name, a file name, a HASH code, a projection authority, an API package name, an intention name, a Dalvik, and the like, and corresponding data only containing any one of the four features such as the projection authority, the API package name, the intention name, and the Dalvik bytecode is obtained through projection. Screening the feature vector of one sample, and marking the corresponding position of the component as the occurrence times when the component of the feature vector appears in the sample; otherwise, marking as 0, generating a sequence, and adding 1 after the sequence of the malicious sample; otherwise, 0 is added. For example, the screening feature vector of Dalvik bytecode is [ "shl-int", "long-to-int", "if-gt" ], and the sample feature vector of one sample is [5,3,21,1], which means that the malicious sample contains 5 "shl-int" Dalvik bytecodes, "long-to-int"3 "and" if-gt "21. In this embodiment, feature screening vectors of four features of authority, API packet name, intent name, dalvik bytecode are 184, 4185, 223, 436 dimensions, respectively.

And 4.4, inputting the feature component screening data set into an automatic machine learning algorithm, and selecting a pipeline with highest accuracy in output pipelines as a feature component screening model of the feature.

In this embodiment, a TPOT automated machine learning algorithm is used to select a pipeline with the highest accuracy in the output pipeline and apply a tree-based machine learning model.

In this embodiment, step 4 obtains four feature component filtering models including authority, API packet name, intent name, dalvik bytecode.

And 5, respectively calculating the saproline average absolute values of all components in the screening feature vectors by using an interpretable machine learning algorithm according to each feature component screening model, wherein the component B before ranking is a positive integer not less than 1, and the component B is used as a sub-feature vector used by a detection model. The sum of the average absolute values of saprolide of the components of the component B before ranking is not less than F times of the sum of the average absolute values of saprolide of the remaining components, and F is a positive integer not less than 4.

In this embodiment, B is 9, the interpretable machine learning algorithm used is SHAP algorithm, the calculated sub-feature vectors used by the authority detection model are [ "SEND_SMS", "GET_TASKS", "READ_PHONE STATE", "RECEIVE_BOOT_COMPLETED", "RECEIVE_SMS", "INSTALL_SHORTCUT", "GET_ACCOUNTS", "VIBRATE", "RECEIVE" ], the sub-feature vectors used by the API packet name detection model are [ "java.util.current.locks", "android.test", "android.text", "android.back", "android.text", "android.voltage", "android.content, android.input.voltage", "android.input" ], the sub feature vectors used by the intention name detection model are [ "android.provider.Telephony.SMS_RECEIVED", "android.intent.action.BOOT_COMPLETED", "android.action.VIEW", "com.google.android.2 dm.RECEIVE", "android.action.CREATE_SHORTCUT", "android.action.PACKAGE_ADED", "android.action.DATA_SMS_RECEIVED", "android.action.INTION.INTION.INTITE", "Dalvine_STATE" ], and the sub feature vectors used by the Dalvik byte code detection model are [ "div-at", "xor-int/li 8", "and-int/2-index-bit", "16", "sensor-factor-16", and "sensor-factor-16".

Step 6, merging all the sub-feature vectors used by the detection model to be used as the features used by the detection model; and extracting feature corresponding data used by the detection models according to the initial Android application software sample library, and horizontally splicing sub-feature vectors used by each detection model in any sequence to form an initial training data set.

According to the features used by the detection model, using a reverse tool Android to match the Android application software APK decompressed file, and if the features used by the detection model appear in the Android application software APK decompressed file, recording the occurrence times; otherwise, marking as 0 to generate a sequence, and adding 1 after the sequence to the malicious sample; otherwise, 0 is added as a detection model sample feature vector.

In this embodiment, the merging mode of the sub-feature vectors used by all the detection models is horizontal stitching, and the feature types used by the obtained detection models are 36. The procedure for composing the initial training dataset is the same as the procedure for constructing the feature component screening dataset in step 4.3.

And 7, training an initial detection model by using a machine learning algorithm based on a tree model on the initial training data set, and outputting the used characteristics of the detection model as the basis of manually verifying the detection model.

The machine learning algorithm based on the tree model in this embodiment is a Catboost algorithm.

And 8, extracting feature corresponding data used by the detection model for the Android malicious software with unknown security, inputting the feature corresponding data into the trained initial detection model, and detecting whether the application is the Android malicious software.

In this embodiment, the "get_permission", "get_services", "get_methods", "get_classes", "get_instructions" commands of the Python integration analog tool are used to extract rights, API package names, schematic names, dalvik byte code features from the APK file, so as to form the sample to be tested. Constructing a sample vector to be detected, wherein the characteristics used by the detection model appear in the sample, and the position component corresponding to the sample vector to be detected is marked as the occurrence times; otherwise, it is noted as 0. And analyzing the sample vector to be detected by using the initial detection model, and outputting a detection result. If the detection result is 1, the Android application software to be detected is malicious software; and if the detection result is 0, the Android application software to be detected is benign software.

And 9, acquiring Android malicious software samples by using a crawler technology according to main stream application markets and safe websites at home and abroad to form a model migration malicious software sample library, wherein the collection date of the malicious software publishing time interval is not more than C months, C is a positive integer not less than 1, the number of the malicious software is D, and D is a positive integer not less than 100.

In this embodiment, C is 12, the crawling website is GitHub, and the test year is 2019 or 2020, where the model migration malicious software sample library in 2019 contains 149 samples, and the model migration malicious software sample library in 2020 has 181 samples.

And 10, extracting feature corresponding data used by the detection model according to the model migration malicious software sample library to form a model migration data set.

In this embodiment, the method for extracting feature corresponding data used in the detection model is the same as that in step 8.

And step 11, calculating test statistics by using a same distribution test algorithm according to the model migration data set and the initial training data set, and judging whether the Android malicious software has conceptual drift.

In this embodiment, the same distribution inspection algorithm is a Mann-Whitney U inspection algorithm, and the detection threshold is 5.

And step 12, if the Android malicious software has concept drift, migrating the initial detection model by using a migration learning field self-adaptive algorithm, iterating E times, wherein E is a positive integer less than or equal to 5, training a new detection model, and replacing the initial detection model by a detection model with the highest accuracy in the E times of iteration.

The adaptive algorithm used in this embodiment is JDA algorithm, and E is 5.

And 13, repeatedly executing the processing of the steps 8-12 by taking the time interval C month as a period, updating a detection model, and detecting the Android malicious software.

In this embodiment, data in 2018, 2019, and 2020 are used as test data, wherein the accuracy of the initial detection model in 2018 is 96%, 34% before step 8-step 12 is executed in 2019, the accuracy is improved to 80% after execution, 43% before step 8-step 12 is executed in 2020, and the accuracy is improved to 87% after execution.

In summary, the detection features are introduced through the artificial Android malicious software analysis report, the traditional feature packaging method is improved based on an automatic machine learning algorithm and an interpretable machine learning algorithm, the same distribution inspection and migration learning algorithm is integrated, the interpretability of the Android malicious software detection model is improved, and the influence of the concept drift problem on the accuracy of the detection model is reduced.

Claims

1. The method for detecting the interpretable Android malicious software facing the concept drift is characterized by comprising the following steps of: the method comprises the following steps:

2. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in step 3, the method for extracting the Android malicious application software to reversely analyze the high-frequency words is a word frequency statistical algorithm, and the valid words of the first A are Android programming language keywords.

3. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in step 4, the following sub-steps are included:

4. The concept drift-oriented interpretable Android malware detection method of claim 3, wherein the method comprises the steps of: in step 4.4, the automatic machine learning algorithm is a TPOT automatic machine learning algorithm, and the pipeline with the highest accuracy in the output pipeline is selected to apply a tree-based machine learning model.

5. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in step 5, the sum of the average absolute values of saprolide of the components in the top B is not less than F times of the sum of the average absolute values of saprolide of the remaining components, and F is a positive integer not less than 4.

6. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in step 5, the interpretable machine learning algorithm is a SHAP algorithm.

7. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in the steps 6, 8 and 10, extracting feature corresponding data used by the detection model, using a reverse tool Android according to the features used by the detection model, matching an Android application software APK decompressed file, and recording the occurrence times if the features used by the detection model appear in the Android application software APK decompressed file; otherwise, marking as 0 to generate a sequence, and adding 1 after the sequence to the malicious sample; otherwise, 0 is added as a detection model sample feature vector.

8. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in step 7, the machine learning algorithm based on the tree model is a Catboost algorithm.

9. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in step 12, the adaptive algorithm is a JDA algorithm.

10. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: a is a positive integer of not less than 4, B is a positive integer of not less than 1, C is a positive integer of not less than 1, D is a positive integer of not less than 100, and E is a positive integer of not more than 5.