CN114996701A - Android privacy disclosure detection method and system based on machine learning - Google Patents

Android privacy disclosure detection method and system based on machine learning Download PDF

Info

Publication number
CN114996701A
CN114996701A CN202210482782.1A CN202210482782A CN114996701A CN 114996701 A CN114996701 A CN 114996701A CN 202210482782 A CN202210482782 A CN 202210482782A CN 114996701 A CN114996701 A CN 114996701A
Authority
CN
China
Prior art keywords
application software
android
android application
sensitive
privacy disclosure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210482782.1A
Other languages
Chinese (zh)
Inventor
赵春蕾
步志亮
宫良一
王嬉
杨艺
李梅彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Technology
Original Assignee
Tianjin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Technology filed Critical Tianjin University of Technology
Priority to CN202210482782.1A priority Critical patent/CN114996701A/en
Publication of CN114996701A publication Critical patent/CN114996701A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities

Abstract

An android privacy disclosure detection method based on machine learning comprises the steps of conducting static and dynamic analysis on android application software, extracting sensitive authority features and sensitive API features highly related to privacy disclosure, conducting vectorization processing on the extracted features, inputting a Stacking model for training, optimizing a two-layer framework of the Stacking model to obtain better parameters, and effectively detecting the android application software with privacy disclosure risks, wherein a system of the method is composed of a data set acquisition module, a key feature selection module, an android application software feature extraction and preprocessing module and a Stacking integrated learning training module; compared with the traditional manual detection method, the detection efficiency is improved, the problem that the source code cannot be obtained after the android application software is shelled in static detection is solved, and the defect of low dynamic detection efficiency is overcome.

Description

Android privacy disclosure detection method and system based on machine learning
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of network security, in particular to an android privacy disclosure detection method and system based on machine learning.
[ background of the invention ]
The android system is the most widely used operating system in mobile equipment, and the popularization of android mobile phones promotes the development of android application software to a certain extent. However, because the amount of software is huge and the application market lacks a strict management system, a large amount of application software cannot be subjected to sufficient security detection before being put on shelf, so that many malicious software is brought into the market. At present, a user can store a large amount of sensitive information in application software of a mobile phone, and the existence of malicious software enables the sensitive information of the user to be easily leaked, so that certain threat is caused to the privacy safety of the user.
The detection method for android privacy disclosure mainly comprises static analysis detection and dynamic analysis detection. The static analysis analyzes the source codes and bytes of the application software, but at the present stage, developers can perform shell-adding and confusion operations on malicious software, so that the detection rate of the static analysis detection method is reduced. Dynamic analysis generally adopts a dynamic taint analysis method, software source codes do not need to be obtained, and only software needs to be dynamically operated, so that whether dangerous behaviors are generated by the software can be monitored in real time. The method can effectively detect the leakage of the sensitive information. The dynamic taint analysis has obvious defects that the android underlying system source code needs to be modified, the android underlying system source code needs to be adjusted according to different android versions, and the android underlying system source code is not easy to deploy. The traditional android privacy disclosure detection is mostly small sample detection, and a detection method for large-batch software facing the application market is lacked. Therefore, a more accurate feature extraction method and a more optimal detection model are required for android privacy disclosure detection.
[ summary of the invention ]
The invention aims to provide an android privacy disclosure detection method and system based on machine learning, the method is simple and easy to implement, can overcome the defects of the prior art, and combines methods such as static analysis, dynamic execution, machine learning and the like to detect whether application software has a risk of privacy disclosure, so that the privacy of an android user is better protected; the system has simple structure and is easy to realize.
The technical scheme of the invention is as follows: an android privacy disclosure detection method based on machine learning is characterized by comprising the following steps:
(1) collecting android application software samples, and screening key features related to privacy disclosure;
the android application software sample in the step (1) consists of two parts; one part of the application program is android application software acquired in an application store by using a crawler method, and the other part of the application program is a sample with privacy disclosure risks in a illegal android application software list issued by the Ministry of industry and communications; and randomly dividing 70% of the sample set into a training data set and 30% of the sample set into a testing set.
The screening in the step (1) specifically comprises the following steps: randomly extracting 30% of sample sets from the collected android Application software samples, collecting android Application software permission information, sequencing android Application software permissions from high to low according to calling frequency, comparing the android Application software permissions with 24 danger level permissions declared by an android official party, determining sensitive permission characteristic information, selecting an Application Programming Interface (API) related to privacy data and an API related to the sensitive permission characteristic information, determining the sensitive API characteristic information, and taking the screened sensitive permission characteristic information and the screened sensitive API characteristic information as key characteristics of the detection method.
(2) Extracting and processing the features screened in the step (1), and performing vectorization representation;
the step (2) specifically comprises the following steps: extracting key characteristics of the Android application software, acquiring authority information applied by the Android application software by using an AAPT (Android Asset Packaging Tool), dynamically installing and operating the Android application software, monitoring in real time by using an Xpos frame, intercepting and recording sensitive API characteristic information of the Android application software, and vectorially representing the intercepted sensitive authority characteristic information and the sensitive API characteristic information by using an One-Hot coding method.
The obtaining of the authority information by using the AAPT tool in the step (2) specifically includes: the permission list of the android application is obtained using the "AAPT d permissions" statement in the AAPT.
The step (2) of using the Xposed framework for real-time monitoring, capturing and recording the sensitive API feature information of the android application software specifically includes: and writing a Hook module by using Android Studio software, adding a Log statement into the module, and outputting the Log statement by the module after the Android application software calls the sensitive API characteristic information for final recording.
The vectorization representation of the intercepted sensitive authority characteristic information and sensitive API characteristic information by using an One-Hot coding method in the step (2) specifically means that: performing One-Hot vectorization conversion on the extracted sensitive authority characteristic information of the android application software and the extracted sensitive API characteristic information of the android application software, namely: if the authority information applied by the android application software is sensitive authority, vectorization is represented as 1; and if no sensitive authority is applied, vectorization is represented as 0, the feature information after One-Hot vectorization conversion is arranged in front of the sensitive authority feature information and after the sensitive API feature information, and a matrix sequence with the size of n x 1 is formed.
The sensitive permission characteristic information comprises permission characteristics which are applied frequently in benign application software and privacy disclosure risk application software of android application software, android official statement dangerous permission characteristics and overlapping permission characteristics between the benign application software and the privacy disclosure risk application software.
The sensitive API characteristic information comprises an API corresponding to sensitive permission, an API capable of executing sensitive operation and an API related to privacy disclosure.
The privacy disclosure risk android application software is android application software with dangerous level permission characteristics or android application software with common level permission which can cause privacy disclosure problems; the android application software with dangerous level authority characteristics is android application software which can cause leakage of personal privacy data of a user by acquiring mobile phone equipment information, the android application software with the privacy leakage risk applies for dangerous level authority, and various hiding strategies are adopted to trick the user into agreeing authorization to exceed authority required by software functions to acquire mobile phone equipment information authority and short message related authority; the android application software with the common-level permission which can cause the privacy disclosure problem is android application software which utilizes malicious software to apply for the common-level permission but has privacy disclosure risks, such as BLUETOOTH (Bluetooth) permission.
(3) Inputting the characteristic information vectorized according to the step (2) into a Stacking model for training, reducing the probability of overfitting by using a five-fold cross validation method, optimizing an android privacy disclosure detection model, and outputting the optimized model;
the Stacking model in the step (3) is composed of two layers of structures, the first layer is a base learner and is composed of three primary classifiers of Logistic Regression (LR), Naive Bayes (Naive Bayes, NB) and K-Nearest Neighbor (KNN), and the second layer is a combined learner and is composed of a Support Vector Machine (SVM).
The five-fold verification method in the step (3) specifically comprises the following steps:
(3-1) training an initial training set train using three primary classifiers, the initial training set train being divided into five sets of similar size (t) 1 ,t 2 ,t 3 ,t 4 ,t 5 ) Four of them were used as training sets train i (i is more than 0 and less than or equal to 5), and the last part is used as a test set test y (0<y≤5);
(3-2) training set train using three primary classifier pairs of the first layer of the Stacking model i (i is more than 0 and less than or equal to 5) to obtain the final prediction result (p) 1 ,p 2 ,p 3 ,p 4 ,p 5 ) The result is simultaneously used as a training set train of the secondary classifier m (m is more than 0 and less than or equal to 5); likewise, test set test is paired with three primary classifiers n (n is more than 0 and less than or equal to 3) to obtain T 1 、T 2 、T 3 The result is; combining each result after the training set and the test set are predicted to obtain a new training set train 3 And test set test 2 The combination is shown in formula (1);
Figure BDA0003628239310000041
(3-3) performing the step (3-2) five times, using the five sets divided in the step (3-1) one by one as test sets, and finally generating five training sets and five test sets;
(3-4) merging the training set and the test set generated in the step (3-2) and respectively using the merged training set and test set as a second layer of the Stacking model to be combined with a learner;
(3-5) the second layer combination learner is a model of a support vector machine, and the optimization training of the model is carried out by using the training set generated by the first layer, and the training is carried out by using the test set generated by the first layer, and the combination is shown as formula (2).
Figure BDA0003628239310000042
(4) Sensitive permission characteristic information and sensitive API characteristic information applied by the android application software are extracted according to the step (2), android privacy disclosure detection is carried out by combining the optimized model obtained in the step (3), whether the application software has a risk of privacy disclosure is judged, and if the model detection result is '1', the android application software is represented to have the risk of privacy disclosure; the detection result of the model is '0', and the android application software does not have the risk of privacy disclosure.
An android privacy disclosure detection system for realizing the method is characterized by comprising a data set acquisition module, a key feature selection module, an android application software feature extraction and preprocessing module and a Stacking integrated learning training module; the data set acquisition module is used for acquiring a required android application software sample; the key characteristic selection module is used for screening the authority and the API characteristics; the android application software feature extraction and preprocessing module is used for extracting and processing information of the android application software; the Stacking integrated learning training module trains an android application software feature extraction and feature information input model processed by the preprocessing module, optimizes an android privacy disclosure detection model, outputs the optimized model, detects the android privacy disclosure by using the optimized model, and judges whether the application software has a risk of privacy disclosure to verify the final effect.
The working principle of the invention is as follows: the android privacy disclosure detection method based on machine learning comprises the steps of firstly performing decompiling processing on android application software and statically extracting sensitive authority feature information. And then dynamically detecting sensitive API characteristic information when the android application software runs, processing the obtained data, performing vectorization representation, training by adopting a Stacking ensemble learning algorithm model, optimizing to obtain a better model structure, and effectively detecting the android application software with privacy disclosure risk.
The invention has the advantages that: compared with the traditional manual detection method, the android privacy disclosure detection method based on machine learning improves detection efficiency, solves the problem that static detection cannot obtain source codes after android application software is shelled, and makes up for the defect of low dynamic detection efficiency.
[ description of the drawings ]
Fig. 1 is a flowchart of an android privacy disclosure detection method based on machine learning according to the present invention.
Fig. 2 is a flow chart of feature extraction in the android privacy disclosure detection method based on machine learning according to the present invention.
Fig. 3 is a schematic diagram of a Stacking algorithm in the android privacy disclosure detection method based on machine learning according to the present invention.
[ detailed description ] A
The embodiment is as follows: a detection method for android privacy disclosure based on machine learning is characterized by comprising the following steps:
(1) collecting android application software samples, and screening key features related to privacy disclosure;
the android application software sample consists of two parts; one part of the application information is android application software acquired by a crawler method in an application store, and the other part of the application information is a sample with privacy disclosure risks in an illegal android application software list released by the Ministry of industry and communications; and randomly dividing 70% of the sample set into a training data set and 30% of the sample set into a testing set.
Randomly extracting a 30% sample set from the collected android application software samples, collecting android application software permission information, sequencing android application software permissions from high to low according to calling frequency, comparing the android application software permissions with 24 danger level permissions declared by an android official authority, determining sensitive permission characteristic information, selecting an API (application programming interface) related to privacy data and an API related to the sensitive permission characteristic information, determining the sensitive API characteristic information, and taking the screened sensitive permission characteristic information and the screened sensitive API characteristic information as key characteristics of the detection method.
(2) Extracting and processing the features screened in the step (1), and performing vectorization expression: extracting key characteristics of the android application software, and acquiring authority information applied by the android application software by using an AAPT tool, namely: the method comprises the steps of obtaining an authority list of the android application software by using an 'AAPT d permissions' statement in the AAPT, dynamically installing and operating the android application software, monitoring in real time by using an Xpos frame, intercepting and recording sensitive API characteristic information of the android application software, and then performing vectorization representation on the intercepted sensitive authority characteristic information and the sensitive API characteristic information by using an One-Hot coding method.
The method for monitoring in real time by using the Xpos framework, and capturing and recording the sensitive API characteristic information of the android application software specifically comprises the following steps: and writing a Hook module by using Android Studio software, adding a Log statement into the module, and outputting the Log statement by the module after the Android application software calls the sensitive API characteristic information for final recording.
The vectorization representation of the intercepted sensitive authority characteristic information and the sensitive API characteristic information by using the One-Hot coding method specifically comprises the following steps: performing One-Hot vectorization conversion on the extracted sensitive authority characteristic information of the android application software and the extracted sensitive API characteristic information of the android application software, namely: if the authority information applied by the android application software is sensitive authority, vectorization is represented as 1; and if no sensitive authority is applied, vectorization is represented as 0, the feature information after One-Hot vectorization conversion is arranged in front of the sensitive authority feature information and after the sensitive API feature information, and a matrix sequence with the size of n x 1 is formed.
The sensitive permission characteristic information comprises permission characteristics which are applied frequently in benign application software and privacy disclosure risk application software of android application software, android official statement dangerous permission characteristics and overlapping permission characteristics between the benign application software and the privacy disclosure risk application software.
The sensitive API characteristic information comprises an API corresponding to sensitive permission, an API capable of executing sensitive operation and an API related to privacy disclosure.
The privacy disclosure risk android application software is android application software with dangerous level permission characteristics or android application software with common level permission which can cause privacy disclosure problems; the android application software with dangerous level authority characteristics is android application software which can cause the leakage of personal privacy data of a user by acquiring mobile phone equipment information, the android application software with privacy leakage risk can apply for dangerous level authority, various hiding strategies are adopted to trick the user to agree to authorize the authority exceeding the authority required by the software function, the mobile phone equipment information authority and short message related authority are acquired, and the mobile phone equipment information relates to the personal privacy data of the user, so that the mobile phone equipment information belongs to dangerous level authority, and the privacy leakage problem can be caused; the android application software with the common-level permission which can cause the privacy disclosure problem is android application software which utilizes malicious software to apply for the common-level permission but has privacy disclosure risks, such as BLUETOOTH (Bluetooth) permission.
(3) Inputting the characteristic information vectorized according to the step (2) into a packing model for training, reducing the over-fitting probability by using a five-fold cross validation method, optimizing an android privacy disclosure detection model, and outputting an optimized model;
the Stacking model is composed of two layers of structures, the first layer is a base learner and is composed of three primary classifiers of logistic regression, naive Bayes and K-nearest neighbor algorithms, and the second layer is a combined learner and is composed of a support vector machine algorithm.
The five-fold verification method specifically comprises the following steps:
(3-1) training an initial training set train using three primary classifiers, the initial training set train being divided into five similar-sized sets (t) 1 ,t 2 ,t 3 ,t 4 ,t 5 ) Four of them were used as training sets train i (i is more than 0 and less than or equal to 5), and the last part is used as a test set test y (0<y≤5);
(3-2) training set train using three primary classifier pairs of the first layer of the Stacking model i (i is more than 0 and less than or equal to 5) to obtain the final prediction result (p) 1 ,p 2 ,p 3 ,p 4 ,p 5 ) The result is simultaneously used as a training set train of the secondary classifier m (m is more than 0 and less than or equal to 5); likewise, test set test is paired with three primary classifiers n (n is more than 0 and less than or equal to 3) to obtain T 1 、T 2 、T 3 The result is; combining each result after the training set and the test set are predicted to obtain a new training set train 3 And test set test 2 The combination is shown in formula (1);
Figure BDA0003628239310000071
(3-3) performing the step (3-2) five times, using the five sets divided in the step (3-1) one by one as test sets, and finally generating five training sets and five test sets;
(3-4) merging the training set and the test set generated in the step (3-2) and respectively using the merged training set and test set as a second layer of the Stacking model to be combined with a learner;
(3-5) the second layer combination learner is a model of the support vector machine, and performs optimization training of the model using the training set generated by the first layer, and performs training using the test set generated by the first layer, the combination of which is shown in formula (2).
Figure BDA0003628239310000072
(4) Sensitive permission characteristic information and sensitive API characteristic information applied by the android application software are extracted according to the step (2), android privacy disclosure detection is carried out by combining the optimized model obtained in the step (3), whether the application software has a risk of privacy disclosure is judged, and if the detection result of the model is '1', the android application software has a risk of privacy disclosure; the detection result of the model is '0', which represents that the android application software has no risk of privacy disclosure.
An android privacy disclosure detection system for realizing the method is characterized by comprising a data set acquisition module, a key feature selection module, an android application software feature extraction and preprocessing module and a Stacking integrated learning training module; the data set acquisition module is used for acquiring a required android application software sample; the key characteristic selection module is used for screening the authority and the API characteristics; the android application software feature extraction and preprocessing module is used for extracting and processing information of the android application software; the Stacking integrated learning training module trains an android application software feature extraction and feature information input model processed by the preprocessing module, optimizes an android privacy disclosure detection model, outputs the optimized model, detects the android privacy disclosure by using the optimized model, and judges whether the application software has a risk of privacy disclosure to verify the final effect.
In order to make the objects, contents and advantages of the present invention clearer, the following description of the specific embodiments of the present invention will be further explained with reference to the drawings and examples.
The invention provides an android privacy disclosure detection method based on machine learning, which is shown in figures 1-3. After obtaining an android application software sample, carrying out static analysis on the android application software, extracting sensitive authority characteristic information, then installing the android application software into an improved simulator, and dynamically executing the android application software and testing to trigger various input events. The Xpos framework monitors android application software in real time, captures and records sensitive API characteristic information, and conducts vectorization conversion on the characteristic information. And training the model by using a Stacking integration algorithm, and detecting privacy disclosure of android application software to obtain a final result.
According to the invention, a crawler method is utilized to obtain benign samples in a public application market, the samples with privacy disclosure risks come from a rule-breaking android application software list released by the Ministry of industry and communications, 70% of a data set is randomly divided out to serve as a training data set, and 30% of the data set serves as a test set. The working part of the android privacy disclosure detection method based on machine learning is explained by taking this as an example.
The first part is to screen feature information related to privacy disclosure. Generally, android applications apply for many permissions to ensure normal operation of the android applications. If all authority features are input into the training model, the phenomenon of poor model training effect or overfitting can be caused, so that a key feature input model needs to be screened to realize a better classification effect.
Google corporation designed a series of permission labels for the android system that relate to secure operations and private information. If an application needs to access user private data or perform other security related operations, the developer needs to declare the corresponding permissions in its profile android main.
The android official provides 183 permission features, and states that 24 permissions are dangerous level permissions. Android applications that are at risk of privacy disclosure may employ various privacy-preserving policies to trick users into agreeing to authorize rights beyond that required for program functionality. Statistical analysis shows that many android applications with privacy disclosure risks can apply for permissions of danger levels, such as obtaining mobile phone device information permissions and short message related permissions, wherein the mobile phone device information relates to personal privacy data of users, and therefore the privacy disclosure problem is caused by the permissions of the danger levels. The problem that privacy is revealed can also be caused to the authority of ordinary level, and statistical analysis discovers that the BLUETOOTH (BLUETOOTH) authority is the authority of ordinary level, but still has a large amount of malicious software to apply for this authority, and the usable BLUETOOTH of the risk software is revealed in privacy carries out the transmission of sensitive information, causes the risk that privacy is revealed to the user. Therefore, the method selects the authority with higher application frequency in benign application and privacy disclosure risk application as one part of key characteristics, compares the authority with 24 dangerous authority characteristics declared by the android official, and takes the union of the two parts as the sensitive authority characteristic information of the method.
Sensitive API characteristic information is determined, the Android system provides tens of thousands of API interfaces, and compared with the authority, the API can embody various behaviors of the Android application software in a fine-grained manner, and meanwhile, the accuracy is higher. And all the behaviors of the android application software are realized by the API, so some sensitive API characteristic information can be used as a basis for judging whether the application software has maliciousness. If all the existing APIs of the Android are monitored, the characteristic dimension is huge, and the detection efficiency is greatly reduced, so that the APIs are screened by the method. The method is mainly divided into three categories:
(1) an API associated with sensitive rights. In order to protect sensitive data of a user, an authority mechanism is designed in an Android system, and if application software wants to acquire some sensitive information or execute some sensitive operations, the application software needs to apply for the authority. The APIs associated with the sensitive rights are extracted.
(2) APIs related to privacy disclosure. The range of the private data is defined according to personal information security regulations issued by Ministry of industry and communications, and the API most relevant to the private data is selected.
(3) An API associated with sensitive operations. Besides the two operations can cause the leakage of the sensitive information, the method also summarizes the following 3 situations which can cause the leakage of the sensitive information:
and (4) revealing Intent privacy: android application software with privacy disclosure risks can use Intent components to communicate, capture sensitive information and further cause privacy data disclosure.
I/O reads system files: when executing a system event, android application software generates a log file in which a lot of sensitive data is stored. Malicious software can access log files through an API interface, eventually leading to privacy disclosure.
Dynamically loading the code: some android applications dynamically load malicious code during running, and privacy disclosure behaviors can be hidden in the dynamically loaded external code, so that the risk of privacy disclosure is generated.
And a second part, extracting characteristic information of the android application software and preprocessing.
The key feature information of the android application software is extracted, and the whole process is shown in fig. 2. Firstly, sensitive authority characteristic information is extracted, and an authority list to be applied by android application software is obtained by using an AAPT tool. And acquiring the package name of the android application software by using an aapt dump bidding statement, and acquiring the authority information of the android application software by using an aapt d permissions statement.
And automatically installing the android application software by using a cmd command line, and automatically testing the android application software by using a Monkey tool. The method is based on an Xpos framework and used for carrying out Hook on an API function called by android application software during dynamic operation. The framework can replace the function interface in a non-invasive way, and can insert code before or after the function call, so that the dynamic analysis detection can be realized.
Firstly, Android Studio software is used for compiling a Hook module for selected sensitive API characteristic information, Log statements are added into the module, then Android application software of a sample library is detected, the Log statements are output by the module after the Android application software calls the sensitive API characteristic information, software behaviors generated in the running process of the Android application software are obtained, and finally the Android application software is unloaded.
Some malicious software can detect the running environment of the android application software, and if the running environment is detected to be a simulator, the flash quitting operation is carried out. Currently, android application software mainly detects relevant information of operating equipment, and checks whether the operating equipment has an IMEI and whether the architecture is an ARM architecture. The method uses Hook technology of an Xpos frame to complement the relevant information of the equipment, so that the detection operation of software on the operating environment is avoided. Aiming at the problem that part of android application software can detect whether an Xpos frame exists in the running environment, the method confuses libraries related to Xpos and modifies return values of certain interfaces of PackageManager, so that the aim of hiding Xpos is fulfilled.
And a third part, establishing and realizing a Stacking ensemble learning detection model. The method adopts a Stacking algorithm, and belongs to heterogeneous integration. The algorithm combines a plurality of types of learners, thereby improving the generalization performance of the classifier. Compared with a single classifier and homogeneous ensemble learning, the Stacking classification method is higher in accuracy.
In the Stacking training process, a secondary training set is generated by a primary learner. In order to reduce the probability of overfitting, a five-fold cross-validation method is adopted. The five-fold verification method divides an initial training set into five sets with similar sizes, one set is selected as a test set, and the other four sets are training sets. Respectively carrying out model training on a training set in the five-fold cross validation by using three types of base learners to generate one part of a secondary training set; and predicting the test set in the five-fold cross validation by using three types of base learners to generate one part of a secondary test set. And repeating the steps to finally generate five training sets and test sets. And merging the training set and the test set of each part to serve as the training set and the test set of the second layer combined learner. And a model of a support vector machine is built on the second layer, the training set generated by the first layer is used for carrying out optimization training on the model, and the test set generated by the first layer is used for carrying out prediction.
For the Stacking method, the classification effect depends on the selection of the classifier. Not only the accuracy of the primary classifier but also the diversity of the classifier are ensured. Based on the above consideration, the primary classifier of the method selects Naive Bayes (NB), Logistic Regression (LR) and K-nearest neighbor (KNN) algorithms, the secondary classifier selects a Support Vector Machine (SVM) algorithm, and the overall model diagram is shown in FIG. 3.
The specific implementation process of the model is as follows:
(1) the Stacking model predicts the label columns of the training set, but the method generates an overfitting phenomenon, so the work adopts a five-fold cross validation method to solve the problem. The training set train is trained using three primary classifiers (NB, LR, KNN). The overall training set train is divided into five parts (t) 1 ,t 2 ,t 3 ,t 4 ,t 5 ) Four of them were used as training sets train i (i is more than 0 and less than or equal to 5), and the last part is used as a test set test y (y is more than 0 and less than or equal to 5). Using three primary classifiers (NB, LR, KNN) for train i (i is more than 0 and less than or equal to 5) to obtain the final prediction result (p) 1 ,p 2 ,p 3 ,p 4 ,p 5 ) The result is also the training set train of the secondary classifier m (m is more than 0 and less than or equal to 5). For tes n T (n is more than 0 and less than or equal to 3) is predicted to obtain T 1 、T 2 、T 3 As a result, each result is combined to obtain a new training set train 3 And test set test 2
Figure BDA0003628239310000111
(2) Then using a secondary classifier to support the vector machine pair train 3 Model training is carried out, and test set test is carried out 2 And (5) predicting to obtain a final result.
Figure BDA0003628239310000112
And fourthly, detecting the android privacy disclosure by using the optimized model. Firstly, permission characteristic information and API characteristic information of android application software are extracted, compared with sensitive permission characteristic information and sensitive API characteristic information, vectorized representation is carried out, the vectorized representation is input into an optimized model, whether the android application software has a risk of privacy disclosure is judged, a model detection result is '1', and the android application software is represented to have the risk of privacy disclosure; the detection result of the model is '0', and the android application software does not have the risk of privacy disclosure.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (9)

1. An android privacy disclosure detection method based on machine learning is characterized by comprising the following steps:
(1) collecting android application software samples, and screening key features related to privacy disclosure;
(2) extracting and processing the features screened in the step (1), and performing vectorization representation;
(3) inputting the characteristic information vectorized according to the step (2) into a Stacking model for training, reducing the probability of overfitting by using a five-fold cross validation method, optimizing an android privacy disclosure detection model, and outputting the optimized model;
(4) sensitive permission characteristic information and sensitive API characteristic information applied by the android application software are extracted according to the step (2), android privacy disclosure detection is carried out by combining the optimized model obtained in the step (3), whether the application software has a risk of privacy disclosure is judged, and if the model detection result is '1', the android application software is represented to have the risk of privacy disclosure; the detection result of the model is '0', which represents that the android application software has no risk of privacy disclosure.
2. The machine learning-based android privacy disclosure detection method according to claim 1, characterized in that in the step (1), the android application software sample is composed of two parts; one part of the application information is android application software acquired by a crawler method in an application store, and the other part of the application information is a sample with privacy disclosure risks in an illegal android application software list released by the Ministry of industry and communications; and randomly dividing 70% of the sample set into a training data set and 30% of the sample set into a testing set.
3. The machine learning-based android privacy disclosure detection method as claimed in claim 1, wherein the screening in the step (1) specifically refers to: randomly extracting a 30% sample set from the collected android application software samples, collecting android application software permission information, sequencing android application software permissions from high to low according to calling frequency, comparing the android application software permissions with 24 danger level permissions declared by an android official authority, determining sensitive permission characteristic information, selecting an API (application programming interface) related to privacy data and an API related to the sensitive permission characteristic information, determining the sensitive API characteristic information, and taking the screened sensitive permission characteristic information and the screened sensitive API characteristic information as key characteristics of the detection method.
4. The machine learning-based android privacy disclosure detection method according to claim 1, wherein the step (2) specifically refers to: extracting key features of android application software, acquiring authority information applied by the android application software by using an AAPT tool, dynamically installing and operating the android application software, performing real-time monitoring by using an Xpos frame, intercepting and recording sensitive API feature information of the android application software, and performing vectorization representation on the intercepted sensitive authority feature information and sensitive API feature information by using an One-Hot coding method.
5. The machine learning-based android privacy disclosure detection method as claimed in claim 4, wherein the obtaining of the authority information using the AAPT tool specifically includes: acquiring an authority list of the android application software by using an AAPT d permissions statement in the AAPT;
the using of the Xposed framework for real-time monitoring, capturing and recording the sensitive API feature information of the android application software specifically means: writing a Hook module by using Android Studio software, adding a Log statement into the module, and outputting the Log statement by the module after calling sensitive API characteristic information by using the Android application software for final recording;
the vectorization representation of the intercepted sensitive authority characteristic information and the sensitive API characteristic information by using the One-Hot coding method specifically includes: performing One-Hot vectorization conversion on the extracted sensitive permission characteristic information of the android application software and the sensitive API characteristic information of the android application software, namely: if the authority information applied by the android application software is a sensitive authority, vectorization is represented as 1; and if the sensitive authority is not applied, vectorization is represented as 0, the feature information after One-Hot vectorization conversion is arranged in front of the sensitive authority feature information and behind the sensitive API feature information to form a matrix sequence with the size of n x 1.
6. The android privacy leakage detection method based on machine learning of claim 5, wherein the sensitive permission characteristic information includes permission characteristics which are applied frequently in benign application software and privacy leakage risk application software of android application software, an android official statement dangerous permission characteristic and an overlapping permission characteristic between the benign application software and the privacy leakage risk application software of the android application software;
the sensitive API characteristic information comprises an API corresponding to sensitive authority, an API capable of executing sensitive operation and an API related to privacy disclosure;
the privacy disclosure risk android application software is android application software with dangerous level permission characteristics or android application software with common level permission which can cause privacy disclosure problems; the android application software with the dangerous level permission characteristics is android application software which can cause leakage of personal privacy data of a user by acquiring information of mobile phone equipment, and the android application software with privacy leakage risk can apply for permission of a dangerous level; the android application software with the common-level authority which can cause the privacy disclosure problem is android application software which applies for the common-level authority by utilizing malicious software and has the privacy disclosure risk.
7. The machine learning-based android privacy disclosure detection method as claimed in claim 1, characterized in that the Stacking model in step (3) is composed of two layers of structures, the first layer is a base learner and is composed of three primary classifiers of logistic regression, naive bayes and K-nearest neighbor algorithm, and the second layer is a combined learner and is composed of a support vector machine algorithm.
8. The machine learning-based android privacy disclosure detection method as claimed in claim 1, wherein the five-fold verification method in the step (3) specifically refers to:
(3-1) training an initial training set train using three primary classifiers, the initial training set train being divided into five similar-sized sets (t) 1 ,t 2 ,t 3 ,t 4 ,t 5 ) Four of them were used as training sets train i (i is more than 0 and less than or equal to 5), and the last part is used as a test set test y (0<y≤5);
(3-2) training set train using three primary classifier pairs of the first layer of the Stacking model i (i is more than 0 and less than or equal to 5) to obtain the final prediction result (p) 1 ,p 2 ,p 3 ,p 4 ,p 5 ) The result is simultaneously used as a training set train of the secondary classifier m (m is more than 0 and less than or equal to 5); likewise, test set test is paired with three primary classifiers n (n is more than 0 and less than or equal to 3) to obtain T 1 、T 2 、T 3 The result is; combining each result after the training set and the test set are predicted to obtain a new training set train 3 And test set test 2 The combination is shown in formula (1);
Figure FDA0003628239300000031
(3-3) performing the step (3-2) five times, using the five sets divided in the step (3-1) one by one as test sets, and finally generating five training sets and five test sets;
(3-4) merging the training set and the test set generated in the step (3-2) and respectively using the merged training set and test set as a second layer of the Stacking model to be combined with a learner;
(3-5) the second layer combination learner is a model of a support vector machine, and the optimization training of the model is carried out by using the training set generated by the first layer, and the training is carried out by using the test set generated by the first layer, and the combination is shown as formula (2).
Figure FDA0003628239300000032
9. An android privacy disclosure detection system for realizing the method is characterized by comprising a data set acquisition module, a key feature selection module, an android application software feature extraction and preprocessing module and a Stacking integrated learning training module; the data set acquisition module is used for acquiring a required android application software sample; the key characteristic selection module is used for screening the authority and the API characteristics; the android application software feature extraction and preprocessing module is used for extracting and processing information of the android application software; the Stacking integrated learning training module trains an android application software feature extraction and feature information input model processed by the preprocessing module, optimizes an android privacy disclosure detection model, outputs the optimized model, detects the android privacy disclosure by using the optimized model, and judges whether the application software has a risk of privacy disclosure to verify the final effect.
CN202210482782.1A 2022-05-05 2022-05-05 Android privacy disclosure detection method and system based on machine learning Pending CN114996701A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210482782.1A CN114996701A (en) 2022-05-05 2022-05-05 Android privacy disclosure detection method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210482782.1A CN114996701A (en) 2022-05-05 2022-05-05 Android privacy disclosure detection method and system based on machine learning

Publications (1)

Publication Number Publication Date
CN114996701A true CN114996701A (en) 2022-09-02

Family

ID=83025935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210482782.1A Pending CN114996701A (en) 2022-05-05 2022-05-05 Android privacy disclosure detection method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN114996701A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168887A (en) * 2022-09-06 2022-10-11 南京熊猫电子股份有限公司 Mobile terminal stealth processing method and device based on differential authority privacy protection
CN115408702A (en) * 2022-11-01 2022-11-29 浙江城云数字科技有限公司 Stacking interface operation risk level evaluation method and application thereof
CN117421730A (en) * 2023-09-11 2024-01-19 暨南大学 Code segment sensitive information detection method based on ensemble learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168887A (en) * 2022-09-06 2022-10-11 南京熊猫电子股份有限公司 Mobile terminal stealth processing method and device based on differential authority privacy protection
CN115408702A (en) * 2022-11-01 2022-11-29 浙江城云数字科技有限公司 Stacking interface operation risk level evaluation method and application thereof
CN115408702B (en) * 2022-11-01 2023-02-14 浙江城云数字科技有限公司 Stacking interface operation risk grade evaluation method and application thereof
CN117421730A (en) * 2023-09-11 2024-01-19 暨南大学 Code segment sensitive information detection method based on ensemble learning

Similar Documents

Publication Publication Date Title
US11568055B2 (en) System and method for automatically detecting a security vulnerability in a source code using a machine learning model
Scalas et al. On the effectiveness of system API-related information for Android ransomware detection
CN114996701A (en) Android privacy disclosure detection method and system based on machine learning
Hou et al. Deep4maldroid: A deep learning framework for android malware detection based on linux kernel system call graphs
US8117660B2 (en) Secure control flows by monitoring control transfers
US20220035919A1 (en) Just in time memory analysis for malware detection
US20200193031A1 (en) System and Method for an Automated Analysis of Operating System Samples, Crashes and Vulnerability Reproduction
US20160094574A1 (en) Determining malware based on signal tokens
US11055168B2 (en) Unexpected event detection during execution of an application
EP3028211A1 (en) Determining malware based on signal tokens
CN112527674A (en) Safety evaluation method, device, equipment and storage medium of AI (Artificial Intelligence) framework
Sun et al. Learning fast and slow: Propedeutica for real-time malware detection
Zhang et al. A multiclass detection system for android malicious apps based on color image features
Dia et al. An empirical evaluation of the effectiveness of smart contract verification tools
Mijwil Malware Detection in Android OS Using Machine Learning Techniques
Kang et al. A study on variant malware detection techniques using static and dynamic features
Lubuva et al. A review of static malware detection for Android apps permission based on deep learning
US8176560B2 (en) Evaluation of tamper resistant software system implementations
Pandey et al. A framework for producing effective and efficient secure code through malware analysis
CN112632547A (en) Data processing method and related device
Zuhair A panoramic evaluation of machine learning and deep learning-aided ransomware detection tools using a hybrid cluster of rich smartphone traits
US20220366048A1 (en) Ai-powered advanced malware detection system
Zhang et al. Contextual approach for identifying malicious Inter-Component privacy leaks in Android apps
Vijay et al. Android-based smartphone malware exploit prevention using a machine learning-based runtime detection system
Alhebsi Android Malware Detection using Machine Learning Techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination