CN113901463B - Concept drift-oriented interpretable Android malicious software detection method - Google Patents

Concept drift-oriented interpretable Android malicious software detection method Download PDF

Info

Publication number
CN113901463B
CN113901463B CN202111033119.5A CN202111033119A CN113901463B CN 113901463 B CN113901463 B CN 113901463B CN 202111033119 A CN202111033119 A CN 202111033119A CN 113901463 B CN113901463 B CN 113901463B
Authority
CN
China
Prior art keywords
android
feature
detection model
model
interpretable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111033119.5A
Other languages
Chinese (zh)
Other versions
CN113901463A (en
Inventor
张炳
文峥
高原
赵旭阳
任家东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yanshan University
Original Assignee
Yanshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yanshan University filed Critical Yanshan University
Priority to CN202111033119.5A priority Critical patent/CN113901463B/en
Publication of CN113901463A publication Critical patent/CN113901463A/en
Application granted granted Critical
Publication of CN113901463B publication Critical patent/CN113901463B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/031Protect user input by software means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a concept drift-oriented interpretable Android malicious software detection method, which belongs to the technical field of information security and comprises the steps of introducing detection features through an artificial Android malicious software analysis report, improving traditional feature packaging based on an automatic machine learning algorithm and an interpretable algorithm, and fusing a same distribution inspection and migration learning algorithm. The method improves the interpretability of the Android malicious software detection model, is beneficial to the reverse analyst to manually verify the detection model, reduces the influence of the concept drift problem on the accuracy of the detection model, is beneficial to the detection model to maintain high accuracy for a long time at low cost, and is used for detecting and analyzing the Android malicious application software.

Description

Concept drift-oriented interpretable Android malicious software detection method
Technical Field
The invention relates to the technical field of information security, in particular to a concept drift-oriented interpretable Android malicious software detection method.
Background
About 206.5 of newly added malicious program samples of the mobile terminal are intercepted by the 360 internet security center in quarter 1 of 2021, which is 426.5% higher than the period of 2020, so that the economic loss of people is 14611 yuan. By the year of 2021, 4 months, compared with the iOS operating system, the Android operating system occupies 76.91% of the mobile terminal market in China, and the application software of the Android open platform is ecological, so that the Android open platform is more vulnerable to malicious software threat.
The existing Android malicious software detection technology is divided into three categories, namely a detection technology based on feature codes, a static detection technology based on machine learning and an application behavior detection technology based on machine learning. The sandbox mechanism of the Android system makes the dynamic behavior monitoring of applications in the non-customized system difficult. Static detection technology based on machine learning becomes a mainstream Android malicious software detection method because of the advantages of high detection accuracy of unknown malicious software, low requirements on equipment hardware and the like.
However, the static detection technology based on machine learning has 3 main problems as follows:
1. the application proportion of request sensitive rights in the application market is decreasing, and part of malicious applications can finish attack on the basis of not applying new rights. A single authority feature, or a combination of features without logic introduction, is not sufficient to characterize malware.
2. While the machine learning algorithm of the black box obtains higher and higher accuracy, the interpretive and transparency requirements of malicious application detection on the model are higher and higher. The Android malicious software reverse personnel need the model to provide decision basis so as to promote manual analysis or judge the rationality of model decision.
3. The high-frequency updating of the Android system version leads to a certain market share of Android applications developed based on software development kits of each version. And due to the concept drift phenomenon, the machine learning model trained at the cost of a large number of samples is poor in detection of Android malicious software in different periods.
Disclosure of Invention
The technical problem to be solved by the invention is to provide the method for detecting the interpretable Android malicious software for the concept drift, so that the interpretability of the Android malicious software detection model is improved, and the influence of the concept drift problem on the accuracy of the detection model is reduced.
In order to solve the technical problems, the invention adopts the following technical scheme:
a concept drift-oriented interpretable Android malicious software detection method comprises the following steps:
step 1, collecting a plurality of manual Android malicious application software analysis reports to form an Android malicious application software manual analysis report sample library;
step 2, collecting a plurality of malicious and benign Android application software samples to form an initial Android application software sample library, wherein the number of the malicious samples is consistent with that of the benign samples;
step 3, extracting the reverse analysis high-frequency words of the Android malicious application software from the Android malicious application software manual analysis report library, wherein the valid words of the first rank A are used as feature types used by a detection model;
step 4, according to an initial Android application software sample library, an automatic machine learning algorithm is used, feature vectors are constructed and screened corresponding to feature types used by each detection model, and feature component screening models are trained, wherein the number of the feature component screening models is A;
step 5, according to each characteristic component screening model, calculating the saproline average absolute value of all components in the screening characteristic vectors by using an interpretable machine learning algorithm, wherein the component with the top B rank is used as a sub-characteristic vector used by a detection model;
step 6, merging all the sub-feature vectors used by the detection model to be used as the features used by the detection model; extracting feature corresponding data used by a detection model according to an initial Android application software sample library to form an initial training data set;
step 7, training an initial detection model by using a machine learning algorithm based on a tree model on the initial training data set, and outputting the used characteristics of the detection model as the basis of manually verifying the detection model;
step 8, extracting feature corresponding data used by a detection model for the Android malicious software with unknown security, inputting the feature corresponding data into a trained initial detection model, and detecting whether the application is the Android malicious software;
step 9, acquiring Android malicious software samples by using a crawler technology according to main stream application markets and safe websites at home and abroad, and forming a model migration malicious software sample library, wherein the collection date of the malicious software publishing time interval is not more than C months, and the number of the malicious software is D;
step 10, extracting feature corresponding data used by a detection model according to a model migration malicious software sample library to form a model migration data set;
step 11, calculating test statistics by using a same distribution test algorithm according to the model migration data set and the initial training data set, and judging whether the Android malicious software has concept drift or not;
step 12, if the Android malicious software has concept drift, migrating an initial detection model by using a migration learning field self-adaptive algorithm, iterating E times, and training a new detection model to replace the initial detection model;
and step 13, repeatedly executing the steps 8-12 by taking the time interval C month as a period, updating the detection model, and detecting the Android malicious software.
The technical scheme of the invention is further improved as follows: in step 3, the method for extracting the Android malicious application software to reversely analyze the high-frequency words is a word frequency statistical algorithm, and the valid words of the first A are Android programming language keywords.
The technical scheme of the invention is further improved as follows: in step 4, the following sub-steps are included:
4.1 projecting a feature type used by a detection model from an initial Android application software sample library;
4.2 if the feature is projected, selecting a feature which is not projected in the feature types used by the detection model, and executing the step 4.1;
4.3, if the feature is not projected, using all the different data of the feature contained in the projected data as a screening feature vector of the feature; constructing a feature component screening dataset, wherein the feature component screening dataset comprises sample feature vectors of all samples;
4.4 inputting the feature component screening data set into an automatic machine learning algorithm, and selecting a pipeline with highest accuracy in output pipelines as a feature component screening model of the feature;
4.5 if there is a feature type of the non-output feature component screening model, executing the step 4.1.
The technical scheme of the invention is further improved as follows: in step 4.4, the automatic machine learning algorithm is a TPOT automatic machine learning algorithm, and the pipeline with the highest accuracy in the output pipeline is selected to apply a tree-based machine learning model.
The technical scheme of the invention is further improved as follows: in step 5, the sum of the average absolute values of saprolide of the components in the top B is not less than F times of the sum of the average absolute values of saprolide of the remaining components, and F is a positive integer not less than 4.
The technical scheme of the invention is further improved as follows: in step 5, the interpretable machine learning algorithm is a SHAP algorithm.
The technical scheme of the invention is further improved as follows: in the steps 6, 8 and 10, extracting feature corresponding data used by the detection model, using a reverse tool Android according to the features used by the detection model, matching an Android application software APK decompressed file, and recording the occurrence times if the features used by the detection model appear in the Android application software APK decompressed file; otherwise, marking as 0 to generate a sequence, and adding 1 after the sequence to the malicious sample; otherwise, 0 is added as a detection model sample feature vector.
The technical scheme of the invention is further improved as follows: in step 7, the machine learning algorithm based on the tree model is a Catboost algorithm.
The technical scheme of the invention is further improved as follows: in step 12, the adaptive algorithm is a JDA algorithm.
The technical scheme of the invention is further improved as follows: a is a positive integer of not less than 4, B is a positive integer of not less than 1, C is a positive integer of not less than 1, D is a positive integer of not less than 100, and E is a positive integer of not more than 5.
By adopting the technical scheme, the invention has the following technical progress:
1. according to the method, the high-frequency words in the attack flow are extracted through the Android malicious software analysis report, and various features of a source code level and an assembly instruction level are introduced, so that the high logic and rationality of the detection features are improved.
2. According to the method, the initial characteristics are combined, so that low storage cost and higher analysis speed are ensured, and malicious software can be better represented.
3. According to the invention, the optimal machine learning classification model based on the tree is screened by using an automatic machine learning algorithm, and compared with a parameter adjustment process in the traditional machine learning model training technology, the method and the device enhance the degree of fit between training data and the model, and improve convenience and efficiency.
4. The method uses an interpretability algorithm to construct a detection model interpretation mechanism, and the screened features have high contribution to the classification results of a plurality of training samples, so that the interpretability and verifiability of the detection model are ensured.
5. According to the invention, a field self-adaptive method is introduced in the technical field of information security, in particular in the Andorid malicious software detection technology, and a small amount of novel Android malicious software is used according to the existing data and detection model, so that the time sequence stability of the model detection accuracy is ensured, and the concept drift problem in Android malicious software detection is effectively relieved.
Drawings
FIG. 1 is a flow chart of a detection method of the present invention;
fig. 2 is a sub-flowchart of the construction of the feature component screening model in the present invention.
Detailed Description
The invention is described in further detail below with reference to the attached drawings and examples:
as shown in fig. 1, a concept drift-oriented interpretable Android malicious software detection method specifically includes the following steps:
and step 1, collecting a sufficient amount of manual Android malicious application software analysis reports to form an Android malicious application software manual analysis report sample library.
According to the embodiment, android malicious software analysis reports are sampled from a Kharon data set to form an Android malicious software manual analysis report sample library, and the Android malicious software analysis report language is English, wherein the total number of words is 4957.
And 2, collecting a sufficient quantity of malicious and benign Android application software samples to form an initial Android application software sample library, wherein the quantity of the malicious samples is consistent with that of the benign samples.
In the embodiment, 2900 Android malicious software are collected from an omniproid data set, 2900 Android malicious software are benign software, and Android application software samples are in an APK format. The malicious software is defined as that more than 50% of the antivirus engine detection results in the VIRUSTOTAL website are positive, and the benign software is defined as that more than or equal to 50% of the antivirus engine detection results in the VIRUSTOTAL website are negative.
And 3, extracting the Android malicious application software from the Android malicious application software manual analysis report library to reversely analyze high-frequency words, wherein the effective words A in the ranking are positive integers not less than 4, and the effective words A are used as feature types for a detection model.
According to the embodiment, a word frequency statistics method is used, android malicious application software is extracted to reversely analyze high-frequency words, A is 4 in the embodiment, effective words of feature types used as detection models are Android programming language keywords, and the feature types used by the detection models are authority, API package names, meaning names and Dalvik byte codes which are extracted from an Android malicious application software manual analysis report library from which nonsensical words such as articles, pronouns and quantity words are removed. Wherein removing words includes, but is not limited to: the, is, to, a, and, in, of, also, from.
And 4, constructing screening feature vectors corresponding to the feature types used by each detection model by using an automatic machine learning algorithm according to an initial Android application software sample library, and training feature component screening models, wherein the number of the feature component screening models is A.
As shown in fig. 2, the method specifically comprises the following substeps:
4.1 projecting a feature type used by the detection model from an initial Android application software sample library.
4.2 if the feature has been projected, selecting a feature not projected from the feature classes used by the detection model, and executing step 4.1.
4.3 if the feature is not projected, using all the different data of the feature contained in the projected data as the screening feature vector of the feature. A feature component screening dataset is constructed comprising sample feature vectors for all samples.
In this embodiment, each sample in the initial Android application software sample library contains 45 features such as an installation package name, a file name, a HASH code, a projection authority, an API package name, an intention name, a Dalvik, and the like, and corresponding data only containing any one of the four features such as the projection authority, the API package name, the intention name, and the Dalvik bytecode is obtained through projection. Screening the feature vector of one sample, and marking the corresponding position of the component as the occurrence times when the component of the feature vector appears in the sample; otherwise, marking as 0, generating a sequence, and adding 1 after the sequence of the malicious sample; otherwise, 0 is added. For example, the screening feature vector of Dalvik bytecode is [ "shl-int", "long-to-int", "if-gt" ], and the sample feature vector of one sample is [5,3,21,1], which means that the malicious sample contains 5 "shl-int" Dalvik bytecodes, "long-to-int"3 "and" if-gt "21. In this embodiment, feature screening vectors of four features of authority, API packet name, intent name, dalvik bytecode are 184, 4185, 223, 436 dimensions, respectively.
And 4.4, inputting the feature component screening data set into an automatic machine learning algorithm, and selecting a pipeline with highest accuracy in output pipelines as a feature component screening model of the feature.
In this embodiment, a TPOT automated machine learning algorithm is used to select a pipeline with the highest accuracy in the output pipeline and apply a tree-based machine learning model.
4.5 if there is a feature type of the non-output feature component screening model, executing the step 4.1.
In this embodiment, step 4 obtains four feature component filtering models including authority, API packet name, intent name, dalvik bytecode.
And 5, respectively calculating the saproline average absolute values of all components in the screening feature vectors by using an interpretable machine learning algorithm according to each feature component screening model, wherein the component B before ranking is a positive integer not less than 1, and the component B is used as a sub-feature vector used by a detection model. The sum of the average absolute values of saprolide of the components of the component B before ranking is not less than F times of the sum of the average absolute values of saprolide of the remaining components, and F is a positive integer not less than 4.
In this embodiment, B is 9, the interpretable machine learning algorithm used is SHAP algorithm, the calculated sub-feature vectors used by the authority detection model are [ "SEND_SMS", "GET_TASKS", "READ_PHONE STATE", "RECEIVE_BOOT_COMPLETED", "RECEIVE_SMS", "INSTALL_SHORTCUT", "GET_ACCOUNTS", "VIBRATE", "RECEIVE" ], the sub-feature vectors used by the API packet name detection model are [ "java.util.current.locks", "android.test", "android.text", "android.back", "android.text", "android.voltage", "android.content, android.input.voltage", "android.input" ], the sub feature vectors used by the intention name detection model are [ "android.provider.Telephony.SMS_RECEIVED", "android.intent.action.BOOT_COMPLETED", "android.action.VIEW", "com.google.android.2 dm.RECEIVE", "android.action.CREATE_SHORTCUT", "android.action.PACKAGE_ADED", "android.action.DATA_SMS_RECEIVED", "android.action.INTION.INTION.INTITE", "Dalvine_STATE" ], and the sub feature vectors used by the Dalvik byte code detection model are [ "div-at", "xor-int/li 8", "and-int/2-index-bit", "16", "sensor-factor-16", and "sensor-factor-16".
Step 6, merging all the sub-feature vectors used by the detection model to be used as the features used by the detection model; and extracting feature corresponding data used by the detection models according to the initial Android application software sample library, and horizontally splicing sub-feature vectors used by each detection model in any sequence to form an initial training data set.
According to the features used by the detection model, using a reverse tool Android to match the Android application software APK decompressed file, and if the features used by the detection model appear in the Android application software APK decompressed file, recording the occurrence times; otherwise, marking as 0 to generate a sequence, and adding 1 after the sequence to the malicious sample; otherwise, 0 is added as a detection model sample feature vector.
In this embodiment, the merging mode of the sub-feature vectors used by all the detection models is horizontal stitching, and the feature types used by the obtained detection models are 36. The procedure for composing the initial training dataset is the same as the procedure for constructing the feature component screening dataset in step 4.3.
And 7, training an initial detection model by using a machine learning algorithm based on a tree model on the initial training data set, and outputting the used characteristics of the detection model as the basis of manually verifying the detection model.
The machine learning algorithm based on the tree model in this embodiment is a Catboost algorithm.
And 8, extracting feature corresponding data used by the detection model for the Android malicious software with unknown security, inputting the feature corresponding data into the trained initial detection model, and detecting whether the application is the Android malicious software.
According to the features used by the detection model, using a reverse tool Android to match the Android application software APK decompressed file, and if the features used by the detection model appear in the Android application software APK decompressed file, recording the occurrence times; otherwise, marking as 0 to generate a sequence, and adding 1 after the sequence to the malicious sample; otherwise, 0 is added as a detection model sample feature vector.
In this embodiment, the "get_permission", "get_services", "get_methods", "get_classes", "get_instructions" commands of the Python integration analog tool are used to extract rights, API package names, schematic names, dalvik byte code features from the APK file, so as to form the sample to be tested. Constructing a sample vector to be detected, wherein the characteristics used by the detection model appear in the sample, and the position component corresponding to the sample vector to be detected is marked as the occurrence times; otherwise, it is noted as 0. And analyzing the sample vector to be detected by using the initial detection model, and outputting a detection result. If the detection result is 1, the Android application software to be detected is malicious software; and if the detection result is 0, the Android application software to be detected is benign software.
And 9, acquiring Android malicious software samples by using a crawler technology according to main stream application markets and safe websites at home and abroad to form a model migration malicious software sample library, wherein the collection date of the malicious software publishing time interval is not more than C months, C is a positive integer not less than 1, the number of the malicious software is D, and D is a positive integer not less than 100.
In this embodiment, C is 12, the crawling website is GitHub, and the test year is 2019 or 2020, where the model migration malicious software sample library in 2019 contains 149 samples, and the model migration malicious software sample library in 2020 has 181 samples.
And 10, extracting feature corresponding data used by the detection model according to the model migration malicious software sample library to form a model migration data set.
According to the features used by the detection model, using a reverse tool Android to match the Android application software APK decompressed file, and if the features used by the detection model appear in the Android application software APK decompressed file, recording the occurrence times; otherwise, marking as 0 to generate a sequence, and adding 1 after the sequence to the malicious sample; otherwise, 0 is added as a detection model sample feature vector.
In this embodiment, the method for extracting feature corresponding data used in the detection model is the same as that in step 8.
And step 11, calculating test statistics by using a same distribution test algorithm according to the model migration data set and the initial training data set, and judging whether the Android malicious software has conceptual drift.
In this embodiment, the same distribution inspection algorithm is a Mann-Whitney U inspection algorithm, and the detection threshold is 5.
And step 12, if the Android malicious software has concept drift, migrating the initial detection model by using a migration learning field self-adaptive algorithm, iterating E times, wherein E is a positive integer less than or equal to 5, training a new detection model, and replacing the initial detection model by a detection model with the highest accuracy in the E times of iteration.
The adaptive algorithm used in this embodiment is JDA algorithm, and E is 5.
And 13, repeatedly executing the processing of the steps 8-12 by taking the time interval C month as a period, updating a detection model, and detecting the Android malicious software.
In this embodiment, data in 2018, 2019, and 2020 are used as test data, wherein the accuracy of the initial detection model in 2018 is 96%, 34% before step 8-step 12 is executed in 2019, the accuracy is improved to 80% after execution, 43% before step 8-step 12 is executed in 2020, and the accuracy is improved to 87% after execution.
In summary, the detection features are introduced through the artificial Android malicious software analysis report, the traditional feature packaging method is improved based on an automatic machine learning algorithm and an interpretable machine learning algorithm, the same distribution inspection and migration learning algorithm is integrated, the interpretability of the Android malicious software detection model is improved, and the influence of the concept drift problem on the accuracy of the detection model is reduced.

Claims (10)

1. The method for detecting the interpretable Android malicious software facing the concept drift is characterized by comprising the following steps of: the method comprises the following steps:
step 1, collecting a plurality of manual Android malicious application software analysis reports to form an Android malicious application software manual analysis report sample library;
step 2, collecting a plurality of malicious and benign Android application software samples to form an initial Android application software sample library, wherein the number of the malicious samples is consistent with that of the benign samples;
step 3, extracting the reverse analysis high-frequency words of the Android malicious application software from the Android malicious application software manual analysis report library, wherein the valid words of the first rank A are used as feature types used by a detection model;
step 4, according to an initial Android application software sample library, an automatic machine learning algorithm is used, feature vectors are constructed and screened corresponding to feature types used by each detection model, and feature component screening models are trained, wherein the number of the feature component screening models is A;
step 5, according to each characteristic component screening model, calculating the saproline average absolute value of all components in the screening characteristic vectors by using an interpretable machine learning algorithm, wherein the component with the top B rank is used as a sub-characteristic vector used by a detection model;
step 6, merging all the sub-feature vectors used by the detection model to be used as the features used by the detection model; extracting feature corresponding data used by a detection model according to an initial Android application software sample library to form an initial training data set;
step 7, training an initial detection model by using a machine learning algorithm based on a tree model on the initial training data set, and outputting the used characteristics of the detection model as the basis of manually verifying the detection model;
step 8, extracting feature corresponding data used by a detection model for the Android malicious software with unknown security, inputting the feature corresponding data into a trained initial detection model, and detecting whether the application is the Android malicious software;
step 9, acquiring Android malicious software samples by using a crawler technology according to main stream application markets and safe websites at home and abroad, and forming a model migration malicious software sample library, wherein the collection date of the malicious software publishing time interval is not more than C months, and the number of the malicious software is D;
step 10, extracting feature corresponding data used by a detection model according to a model migration malicious software sample library to form a model migration data set;
step 11, calculating test statistics by using a same distribution test algorithm according to the model migration data set and the initial training data set, and judging whether the Android malicious software has concept drift or not;
step 12, if the Android malicious software has concept drift, migrating an initial detection model by using a migration learning field self-adaptive algorithm, iterating E times, and training a new detection model to replace the initial detection model;
and step 13, repeatedly executing the steps 8-12 by taking the time interval C month as a period, updating the detection model, and detecting the Android malicious software.
2. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in step 3, the method for extracting the Android malicious application software to reversely analyze the high-frequency words is a word frequency statistical algorithm, and the valid words of the first A are Android programming language keywords.
3. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in step 4, the following sub-steps are included:
4.1 projecting a feature type used by a detection model from an initial Android application software sample library;
4.2 if the feature is projected, selecting a feature which is not projected in the feature types used by the detection model, and executing the step 4.1;
4.3, if the feature is not projected, using all the different data of the feature contained in the projected data as a screening feature vector of the feature; constructing a feature component screening dataset, wherein the feature component screening dataset comprises sample feature vectors of all samples;
4.4 inputting the feature component screening data set into an automatic machine learning algorithm, and selecting a pipeline with highest accuracy in output pipelines as a feature component screening model of the feature;
4.5 if there is a feature type of the non-output feature component screening model, executing the step 4.1.
4. The concept drift-oriented interpretable Android malware detection method of claim 3, wherein the method comprises the steps of: in step 4.4, the automatic machine learning algorithm is a TPOT automatic machine learning algorithm, and the pipeline with the highest accuracy in the output pipeline is selected to apply a tree-based machine learning model.
5. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in step 5, the sum of the average absolute values of saprolide of the components in the top B is not less than F times of the sum of the average absolute values of saprolide of the remaining components, and F is a positive integer not less than 4.
6. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in step 5, the interpretable machine learning algorithm is a SHAP algorithm.
7. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in the steps 6, 8 and 10, extracting feature corresponding data used by the detection model, using a reverse tool Android according to the features used by the detection model, matching an Android application software APK decompressed file, and recording the occurrence times if the features used by the detection model appear in the Android application software APK decompressed file; otherwise, marking as 0 to generate a sequence, and adding 1 after the sequence to the malicious sample; otherwise, 0 is added as a detection model sample feature vector.
8. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in step 7, the machine learning algorithm based on the tree model is a Catboost algorithm.
9. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: in step 12, the adaptive algorithm is a JDA algorithm.
10. The concept drift-oriented interpretable Android malware detection method of claim 1, wherein the method comprises the steps of: a is a positive integer of not less than 4, B is a positive integer of not less than 1, C is a positive integer of not less than 1, D is a positive integer of not less than 100, and E is a positive integer of not more than 5.
CN202111033119.5A 2021-09-03 2021-09-03 Concept drift-oriented interpretable Android malicious software detection method Active CN113901463B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111033119.5A CN113901463B (en) 2021-09-03 2021-09-03 Concept drift-oriented interpretable Android malicious software detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111033119.5A CN113901463B (en) 2021-09-03 2021-09-03 Concept drift-oriented interpretable Android malicious software detection method

Publications (2)

Publication Number Publication Date
CN113901463A CN113901463A (en) 2022-01-07
CN113901463B true CN113901463B (en) 2023-06-30

Family

ID=79188640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111033119.5A Active CN113901463B (en) 2021-09-03 2021-09-03 Concept drift-oriented interpretable Android malicious software detection method

Country Status (1)

Country Link
CN (1) CN113901463B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI822388B (en) * 2022-10-12 2023-11-11 財團法人資訊工業策進會 Labeling method for information security protection detection rules and tactic, technique and procedure labeling device for the same
CN115795466B (en) * 2023-02-06 2023-06-20 广东省科技基础条件平台中心 Malicious software organization identification method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684840A (en) * 2018-12-20 2019-04-26 西安电子科技大学 Based on the sensitive Android malware detection method for calling path
CN110519228A (en) * 2019-07-22 2019-11-29 中国科学院信息工程研究所 A kind of black recognition methods and system for producing malice cloud robot under scene
CN111259393A (en) * 2020-01-14 2020-06-09 河南信息安全研究院有限公司 Anti-concept drift method of malicious software detector based on generation countermeasure network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11210396B2 (en) * 2017-08-25 2021-12-28 Drexel University Light-weight behavioral malware detection for windows platforms
US20200034692A1 (en) * 2018-07-30 2020-01-30 National Chengchi University Machine learning system and method for coping with potential outliers and perfect learning in concept-drifting environment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684840A (en) * 2018-12-20 2019-04-26 西安电子科技大学 Based on the sensitive Android malware detection method for calling path
CN110519228A (en) * 2019-07-22 2019-11-29 中国科学院信息工程研究所 A kind of black recognition methods and system for producing malice cloud robot under scene
CN111259393A (en) * 2020-01-14 2020-06-09 河南信息安全研究院有限公司 Anti-concept drift method of malicious software detector based on generation countermeasure network

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
A Unified Approach to Interpreting Model Predictions;Scott M. Lundberg 等;《31st Conference on Neural Information Processing Systems (NIPS 2017)》;第1-10页 *
InterDroid: 面向概念漂移的可解释性 Andorid恶意软件检测方法;张炳 等;《计算机研究与发展》;第58卷(第11期);第2456-2474页 *
MAMADROID: Detecting Android Malware by Building Markov Chains of Behavioral Models;Enrico Mariconti 等;《the Proceedings of 24th Network and Distributed System Security Symposium (NDSS 2017)》;第1-16页 *
S-C特征提取的计算机漏洞自动分类算法;任家东 等;计算机科学与探索;第14卷(第07期);第1173-1182页 *
双粒度轻量级漏洞代码切片方法评估模型;张炳 等;《通信学报》;第42卷(第11期);第233-241页 *
基于可信度的Android恶意代码多模型协同检测方法;张永生 等;广西师范大学学报(自然科学版);第38卷(第02期);第19-28页 *
安卓恶意软件检测方法综述;范铭 等;中国科学:信息科学;第50卷(第08期);第1148-1177页 *

Also Published As

Publication number Publication date
CN113901463A (en) 2022-01-07

Similar Documents

Publication Publication Date Title
CN106446691B (en) The method and apparatus for the open source projects loophole for integrating or customizing in inspection software
CN108268777B (en) Similarity detection method for carrying out unknown vulnerability discovery by using patch information
CN113901463B (en) Concept drift-oriented interpretable Android malicious software detection method
CN109933984B (en) Optimal clustering result screening method and device and electronic equipment
CN109271788B (en) Android malicious software detection method based on deep learning
CN109117164B (en) Micro-service updating method and system based on difference analysis of key elements
CN104123493A (en) Method and device for detecting safety performance of application program
Lomio et al. Just-in-time software vulnerability detection: Are we there yet?
CN105068921A (en) App comparative analysis based Android application store credibility evaluation method
CN111783016B (en) Website classification method, device and equipment
Kang et al. A secure-coding and vulnerability check system based on smart-fuzzing and exploit
Villanes et al. What are software engineers asking about android testing on stack overflow?
CN112035359A (en) Program testing method, program testing device, electronic equipment and storage medium
CN114491566B (en) Fuzzy test method and device based on code similarity and storage medium
CN109325353A (en) A kind of cluster leak analysis method for home router
CN113158251A (en) Application privacy disclosure detection method, system, terminal and medium
Tian et al. Enhancing vulnerability detection via AST decomposition and neural sub-tree encoding
CN113935041A (en) Vulnerability detection system and method for real-time operating system equipment
CN111625448B (en) Protocol packet generation method, device, equipment and storage medium
US7647581B2 (en) Evaluating java objects across different virtual machine vendors
US20060004810A1 (en) Method, system and product for determining standard java objects
CN114285587A (en) Domain name identification method and device and domain name classification model acquisition method and device
Kang A review on javascript engine vulnerability mining
CN113626823A (en) Reachability analysis-based inter-component interaction threat detection method and device
Guo et al. An investigation of quality issues in vulnerability detection datasets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant