CN113076538B - Method for extracting embedded privacy policy of mobile application APK file - Google Patents

Method for extracting embedded privacy policy of mobile application APK file Download PDF

Info

Publication number
CN113076538B
CN113076538B CN202110359392.0A CN202110359392A CN113076538B CN 113076538 B CN113076538 B CN 113076538B CN 202110359392 A CN202110359392 A CN 202110359392A CN 113076538 B CN113076538 B CN 113076538B
Authority
CN
China
Prior art keywords
privacy policy
page
apk file
extracting
mobile application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110359392.0A
Other languages
Chinese (zh)
Other versions
CN113076538A (en
Inventor
郭燕慧
徐国爱
徐国胜
张淼
王皓月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202110359392.0A priority Critical patent/CN113076538B/en
Publication of CN113076538A publication Critical patent/CN113076538A/en
Application granted granted Critical
Publication of CN113076538B publication Critical patent/CN113076538B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Virology (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Debugging And Monitoring (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method for extracting an embedded privacy policy of a mobile application APK file, which belongs to the field of analysis and detection of application software of an android mobile terminal, and specifically comprises the following steps: firstly, selecting to-be-detected APK file decompiling and rule matching, acquiring all URL links, crawling each webpage content respectively, and extracting feature words in a privacy policy text. Meanwhile, collecting feature words of a plurality of webpages and training a binary model in advance; inputting each feature word of the APK file to be detected into the trained two-classification model one by one, judging whether a privacy policy page exists in an output result, and if so, outputting the privacy policy and ending; otherwise, carrying out automatic dynamic test, extracting corresponding URL links by monitoring the request address in the flow, crawling the content of each page to extract characteristic words, and inputting the characteristic words into a two-classification model for judgment until a privacy policy page is found or the set traversal depth is exceeded. According to the invention, through the combination of dynamic and static tests, the extraction efficiency and the success rate of the privacy policy are improved.

Description

Method for extracting embedded privacy policy of mobile application APK file
Technical Field
The invention belongs to the field of analysis and detection of application software of an android mobile terminal, and relates to a method for extracting an embedded privacy policy of an APK (android package) file of a mobile application.
Background
Static analysis is a technique of generating a decompilated code of a program by scanning a program file by various means such as lexical analysis or syntactic analysis without running, and then reading the decompilated code to grasp a program function, and is essentially static text analysis, thereby having high analysis efficiency.
Common static decompilation tools include apktool, backsmmali, dex2jar and the like, wherein the apktool is the most common decompilation tool for static analysis, is compiled by Java, can decompilate and decompilate an APK file, and simultaneously has the functions of installing a specific frame-res framework, cleaning a last decompilation folder and the like.
After a decompiling tool is used for decompiling a sample to be tested, static analysis is needed to be carried out on the code, the static analysis is data flow analysis, a data flow graph among Android system components is constructed through semantic analysis of the code, and the data flow among the components is analyzed. The technology is the most mature and extensive Android static analysis technology applied at home and abroad at present, for example, the famous Flowdroid is a typical data flow analysis tool which is developed based on a Java code analysis tool Soot, and a data flow graph is established by generating a virtual main function in a program so that researchers can analyze the data flow of a sample.
The domestic and foreign Android application automatic test is developed towards the direction of instrumentization and framing, and can be divided into three development modes, namely modular test, separation of test data and test cases and keyword drive test. The current mainstream android automated testing tool is as follows:
the Android Debug Bridge translation is called an Android Debug Bridge, ADB for short, is a set of debugging tools provided by Google for developers, can read various parameters of an Android system and manage various details of a simulator or an entity machine, and has strong functions and good compatibility. The ADB tool adopts a C/S model in design and can be roughly divided into three modules: (1) the ADB client is responsible for receiving and executing the ADB command sent by the developer, and can also receive the call of other debugging tools such as DDMS and manage a plurality of test devices simultaneously. (2) And the ADB daemon process runs on the background of the testing equipment such as the mobile phone or the simulator. (3) And the ADB server runs in the background of the developer computer and is responsible for managing the other two parts of ADB clients and ADB daemon of the ADB tool.
When the Android version 4.1 (API 16) is released, the Android development team synchronously launches a very excellent UI automation testing tool, namely Uiautormator. The tool aims to help developers debug mobile applications more efficiently and help testers acquire control structures of application pages. Uiautomator actually contains two automated test tools: one is Uiautoviewer, which is used for acquiring a UI control structure of an application interaction page, and is used in combination with an ADB shell command, so that the UI control structure can be stored in a mobile phone directory in an xml file format, and information such as a control type (class), a unique identifier (resource-id), control text information (text) and the like can be acquired through the file. The second is Uiautomator library, which is an automatic test tool library realized based on Java language, provides a lot of rich APIs and engines for executing automatic tests for testers, and can simulate the operations of clicking, sliding, keyboard input and the like of users.
However, the existing research aiming at the extraction of the embedded privacy policy of the mobile application is less, and the main method is to analyze the file structure of the application program through static analysis, preprocess the input sample, extract the required information and generate the Activity tree diagram; and then traversing an Activity traversal script written by a strategy based on the Activity tree graph and the tree hierarchy, wherein the main task is to find a page where a privacy protocol is located, and obtain a privacy policy file in the mobile application by matching keywords of a page related control text.
Because the accuracy of the Activity tree graph decreases with the hierarchy, and the judgment of the privacy policy link is realized based on the control text matching, a missing phenomenon possibly exists, and therefore, a space for improving the success rate of the privacy policy is still existed. In addition, the method needs two steps of static analysis and automatic testing for each input sample, and the extraction speed is low due to the complexity of the steps and the time consumption of the automatic testing.
With the increasing abundance of functions of mobile applications, the mobile application plays an increasingly important role in work, study and life of people, and can acquire more and more user information, and a new privacy invasion problem also comes along with the function. In order to meet the requirement of protecting privacy of a user, application developers need to improve the behavior of applications continuously, and a platform needs to improve the supervision continuously. To protect the user's private information, the policy requires that the application provide a corresponding private policy document. In 11 months in 2019, the ministry of industry and communications and the like jointly release an ' APP illegal collection and use personal information behavior identification method ', wherein for the first item of the ' unpublished collection and use rule ' identification method ' without a privacy policy in APP or a personal information rule collected and used in the privacy policy, in order to find the illegal behavior aiming at the term in mobile application, research on finding and extracting the embedded privacy policy link in the mobile application APK installation file is needed.
Disclosure of Invention
Aiming at the problems, the invention researches an extraction method of the embedded privacy policy of the APK file of the mobile application, improves the extraction success rate of the embedded privacy policy of the mobile application through the cooperation of various extraction methods aiming at the android mobile application, is beneficial to the supervision of an application developer and related supervision departments on the mobile application privacy policy, and is convenient for a privacy policy analysis technology researcher to collect data for research.
The method for extracting the embedded privacy policy of the mobile application APK file comprises the following specific steps:
selecting a mobile application APK file to be detected, and performing decompiling on the mobile application APK file by using a decompiling tool apktool;
performing decompiling to obtain the files of the XML configuration, the language resources and the like, the picture and the XML of the APK file;
step two, regular expression is utilized to carry out rule matching on the smali codes, and all URL link sets in the APK file are obtained;
step three, crawling the obtained webpage content in each URL link by using a selenium crawler;
and step four, respectively preprocessing each webpage content, and extracting the feature words in the privacy policy text by using a chi-square detection algorithm.
The specific process is as follows:
firstly, deleting the crawled tags irrelevant to the page text of the webpage and the content thereof or phrases relevant to page navigation;
then, converting the residual text document into a markdown format, normalizing the Unicode characters, stripping the markdown format and outputting a plain text document;
then, performing word segmentation on the plain text document, and removing words which are irrelevant to text type characteristic information or have low obvious relevance;
and finally, calculating respective weights of all the obtained final participles by using a Chi-square test algorithm, sorting the final participles according to the descending power, selecting the participles with the fixed length l as characteristic words and forming word vectors.
The fixed length is set according to the input length of the binary model.
Collecting a privacy policy webpage and a non-privacy policy webpage, and respectively preprocessing the privacy policy webpage and the non-privacy policy webpage to obtain feature words for training the two classification models;
the method specifically comprises the following steps:
firstly, crawling each webpage content, and then respectively obtaining preprocessed feature words corresponding to each webpage, and dividing the preprocessed feature words into a training set and a test set;
then, selecting l for the characteristic words of each page in the training set0Formed to an initial length l0Inputting the word vectors into a binary classification model for training;
from the results of the binary model, for the initial length l0And continuously adjusting until a fixed length l meeting the precision of the two classification models is obtained.
And finally, testing the trained two-classification model by using a test set.
And step six, forming word vectors with fixed length l by using the feature words extracted from each webpage of the mobile application APK file to be detected, inputting the word vectors into the trained two-class model one by one, judging whether the privacy policy page exists in the output result, if so, ending, and outputting the privacy policy. Otherwise, entering the step seven;
step seven, automatically installing the APK file to be tested on a testing machine, and carrying out automatic dynamic testing;
step eight, carrying out depth level simulated clicking on key controls of each page of the testing machine one by using an ADB shell command, monitoring request addresses fed back by each click, and extracting corresponding URL links from the request addresses;
the key controls are as follows: the method comprises the steps of obtaining a UI structure tree of each page by using an existing tool UI Automator, extracting each control in the tree, and marking the controls containing keywords such as 'privacy', 'service', 'user' and the like in text elements of clickable controls by traversing each control to be used as key controls.
Step nine, crawling the page content linked by each URL through a crawler, returning to the step four to carry out preprocessing, and inputting a binary classification model to judge until a privacy policy page is found or the set traversal depth is exceeded, so as to finish the method.
The traversal depth is specified artificially.
The invention has the advantages that:
1) the method for extracting the embedded privacy policy of the APK file of the mobile application combines the static test and the dynamic test of the mobile application to realize the automatic extraction of the embedded privacy policy of the android application program.
2) The method for extracting the embedded privacy policy of the mobile application APK file is based on a machine learning algorithm, and based on content analysis and feature extraction of existing privacy policy links of an application store, a two-classification model for judging whether a page is a privacy policy page is trained, static analysis is carried out on the application program APK file, all URLs are extracted, and judgment is carried out respectively, so that the privacy policy links are extracted. Compared with the prior art that the privacy policy can be obtained only by performing static and dynamic tests on the analysis of each program, the method can effectively improve the extraction efficiency of the privacy policy and has higher extraction success rate.
3) The method for extracting the privacy policy embedded in the mobile application APK file provides automatic extraction of the privacy policy in an application program through flow detection, and can increase the extraction rate of the privacy policy.
4) The method for extracting the embedded privacy policy of the APK file of the mobile application is suitable for detecting the android mobile application and automatically extracting the embedded privacy policy.
5) The method for extracting the embedded privacy policy of the mobile application APK file is suitable for security supervision departments, application developers or individual users, automatic privacy policy extraction is carried out on android mobile applications, support is provided for supervision of the mobile applications, and data collection and acquisition of researchers in relevant technologies of analysis of the privacy policies of the android mobile applications are facilitated.
Drawings
FIG. 1 is a schematic diagram of a method for extracting an embedded privacy policy of a mobile application APK file according to the present invention;
FIG. 2 is a schematic diagram of an embedded privacy policy extraction module of a mobile application APK file according to the present invention;
fig. 3 is a flowchart of a method for extracting an embedded privacy policy of an APK file of a mobile application according to the present invention.
Detailed Description
The present invention will be described in further detail and with reference to the accompanying drawings so that those skilled in the art can understand and practice the invention.
Based on an android application program, the method utilizes dynamic and static detection to detect the privacy policy link embedded in the mobile application APK file, and automatically discovers and extracts the privacy policy link contained in the APK file; the overall workflow is shown in fig. 1: firstly, carrying out static analysis on a mobile application APK file to be tested, and acquiring all URL link sets contained in the APK file through decompiling and rule matching; crawling URL links by using a crawler to correspond to the webpage content, and extracting page features; meanwhile, extracting the characteristics of each page through a privacy policy crawler and inputting the characteristics into a classification model, and training a privacy policy page judgment model; and judging the extracted features of the obtained page through a privacy policy page judgment model based on a machine learning algorithm so as to find the privacy policy page of the mobile application. Carrying out automatic testing on a mobile application sample of which the static analysis fails to find the privacy policy page; and (3) the probability of the mobile application having the privacy policy page is high, installing a popup registration page for automatic detection, and identifying whether the privacy policy link exists or not through interface control identification and application running flow information until the limited traversal depth is reached.
The above process consists of 2 main modules, as shown in fig. 2, which are a privacy policy static discovery module and a privacy policy dynamic extraction module, respectively; the privacy policy static discovery module performs static analysis on the mobile application APK file, including reverse, rule matching and page downloading, acquires corresponding page content through discovery and downloading of URL links in the file, and finally performs page judgment;
the privacy policy dynamic extraction module is used for supplementing the function of the static discovery module, dynamically testing the application program through automatic testing under the condition that the static discovery module does not discover the privacy policy link, and extracting the privacy policy link of the mobile application program in two modes of identifying the control of the counterweight interface and detecting the flow.
The invention relates to a method for extracting an embedded privacy policy of an APK file of a mobile application, which comprises the following specific steps as shown in FIG. 3:
step one, a user issues a privacy policy detection task, selects a certain mobile application APK file to be detected from the privacy policy detection task, and uses a decompilation tool apktool to decompilate the mobile application APK file;
the privacy policy detection task is to detect a plurality of mobile application APK files;
performing decompiling to obtain a smali code, an android manifest, an XML file, a picture, an XML configuration, a language resource and other files of the APK file;
step two, regular expression is utilized to carry out rule matching on the smali codes, and all URL link sets in the APK file are obtained;
and performing rule matching on the mobile application package file obtained by decompiling, and performing matching on the regular expression in the URL link format to obtain all URL links contained in the APK file, and performing deduplication processing on the obtained URL links.
Step three, crawling the obtained webpage content in each URL link by using a selenium crawler;
and automatically acquiring and storing the obtained URL page content by using a web crawler and a Beautiful Soup tool, and providing data for subsequent analysis and verification.
And step four, respectively preprocessing each webpage content, and extracting the feature words in the privacy policy text by using a chi-square detection algorithm.
The part uses a web crawler to collect existing privacy policy links of a domestic application store platform, uses existing samples to analyze and extract features to form a classification model, classifies pages obtained in the previous step, and judges whether the obtained pages are privacy policy pages or not.
The preprocessing includes both preprocessing of the privacy policy page and preprocessing of the privacy document.
The specific process is as follows:
firstly, deleting the crawled tags irrelevant to the webpage text or the privacy policy text and the display content thereof, or phrases relevant to page navigation;
such as files containing html, comment, style, or script tags;
then, converting the residual text document into a markdown format by using html2text, standardizing the Unicode characters, stripping the markdown format, and outputting a plain text document of the privacy policy page content;
unicode characters such as title tags, bullets, list item numbers, and other format characters.
Then, performing word segmentation on the plain text document by using a Jieba word segmentation tool, and calling a work-in-the-Harmony disabled word list to remove words which are irrelevant to text type characteristic information or have low obvious relevance;
and finally, calculating respective weights of all the obtained final participles by using a Chi-square test algorithm, sorting the final participles according to the descending power, selecting the participles with the fixed length l as characteristic words and forming word vectors.
The fixed length is set according to the input length of the binary model.
Collecting a privacy policy webpage and a non-privacy policy webpage, and respectively preprocessing the privacy policy webpage and the non-privacy policy webpage to obtain feature words for training the two classification models;
the method specifically comprises the following steps:
firstly, crawling a privacy policy link provided in an application store, downloading page content and preprocessing the page content; meanwhile, crawling a plurality of non-privacy policy links, downloading page content, preprocessing, and dividing the obtained characteristic words after preprocessing of each webpage into a training set and a test set;
the crawling takes Hua as the script framework + selenium crawler in the application market, possibly one for each app, but possibly duplicative of the same company.
Then, selecting l for the characteristic words of each page in the training set0Formed to an initial length l0Inputting the word vectors into machine learning, and training the binary classification model;
training a binary classification model by using a naive Bayes and support vector machine based on a scimit-learn toolkit;
from the results of the binary model, for the initial length l0Continuously adjusting until a fixed length l meeting the precision of the two classification models is obtained;
and finally, testing the trained two-classification model by using a test set.
And step six, forming word vectors with fixed length l by using the feature words extracted from each webpage of the mobile application APK file to be detected, inputting the word vectors into the trained two-class model one by one, judging whether the privacy policy page exists in the output result, if so, ending, and outputting the privacy policy. Otherwise, entering the step seven;
step seven, automatically installing the APK file to be tested on a testing machine, and carrying out automatic dynamic testing;
step eight, carrying out depth level simulated clicking on key controls of each page of the testing machine one by using an ADB shell command, monitoring request addresses fed back by clicking in the flow information, and extracting corresponding URL links from the request addresses;
the key controls are as follows: when the APP runs to a new page, the UI structure tree of the current page is obtained by using the existing tool UI Automator, each control in the tree is extracted, the control is initialized to a class corresponding to the system and then pressed into a stack, the control elements in the stack are traversed, and controls containing keywords such as 'privacy', 'service', 'user' and the like in the text elements of the clickable controls are marked to be used as key controls;
and clicking the obtained key control, monitoring the flow information of the application program by using a tool, and extracting a privacy policy link from the flow information.
Step nine, crawling the page content linked by each URL through a crawler, returning to the step four to carry out preprocessing, inputting a binary model to judge until a privacy policy page is found or the set traversal depth is exceeded, and returning a check result to a user.
The traversal depth is specified artificially.

Claims (6)

1. A method for extracting an embedded privacy policy of an APK file of a mobile application is characterized by comprising the following specific steps:
firstly, performing decompiling and rule matching on a mobile application APK file to be detected to obtain all URL link sets in the APK file;
secondly, crawling web page contents corresponding to the URL links by using a crawler respectively, and extracting feature words in the privacy policy text;
meanwhile, collecting a plurality of private policy webpages and non-private policy webpages, and respectively extracting feature words for training a binary model in the same way;
finally, inputting the extracted feature words of the APK file to be detected into the trained two-classification model one by one, judging whether the output result has a privacy policy page, and if so, outputting the privacy policy and ending; otherwise, carrying out automatic dynamic test;
the automatic dynamic test comprises the following specific processes:
firstly, automatically installing an APK file to be tested on a testing machine, carrying out depth level simulated clicking on key controls of each page one by using an ADB shell command, monitoring request addresses fed back by each click, and extracting corresponding URL links from the request addresses;
crawling page contents linked with all URLs through a crawler, preprocessing the page contents again, and inputting the obtained feature words into a binary model for judgment until a privacy policy page is found or the set traversal depth is exceeded;
the specific process of crawling each webpage content and extracting the feature words in the privacy policy text is as follows:
firstly, deleting the crawled tags irrelevant to the page text of the webpage and the content thereof or phrases relevant to page navigation;
then, converting the residual text document into a markdown format, normalizing the Unicode characters, stripping the markdown format and outputting a plain text document;
then, performing word segmentation on the plain text document, and removing words which are irrelevant to text type characteristic information or have low obvious relevance;
finally, calculating respective weights of all the obtained final participles by using a Chi-square test algorithm, sorting the final participles according to a power reduction, selecting the participles with the number of fixed length l as characteristic words and forming word vectors;
the fixed length is set according to the input length of the binary model.
2. The method as claimed in claim 1, wherein the decompiling obtains a smali code, a picture, an XML configuration and a language resource of the APK file.
3. The method for extracting the embedded privacy policy of the mobile application APK file according to claim 1, wherein the rule matching adopts a regular expression.
4. The method for extracting the embedded privacy policy of the mobile application APK file according to claim 1, wherein the specific process of training the two-class model comprises:
firstly, crawling each webpage content, and then respectively obtaining preprocessed feature words corresponding to each webpage, and dividing the preprocessed feature words into a training set and a test set;
then, selecting l for the characteristic words of each page in the training set0Formed to an initial length l0Inputting the word vectors into a binary classification model for training;
results from the binary model, for the beginningInitial length l0Continuously adjusting until a fixed length l meeting the precision of the two classification models is obtained;
and finally, testing the trained two-classification model by using a test set.
5. The method for extracting the embedded privacy policy of the mobile application APK file according to claim 1, wherein the key controls are specifically:
the method comprises the steps of obtaining a UI structure tree of each page by simulating and clicking each page, extracting each control in the tree, and marking the controls containing keywords of privacy, service and user in text elements of clickable controls as key controls by traversing each control.
6. The method for extracting the embedded privacy policy of the mobile application APK file as claimed in claim 1, wherein the traversal depth is artificially defined.
CN202110359392.0A 2021-04-02 2021-04-02 Method for extracting embedded privacy policy of mobile application APK file Active CN113076538B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110359392.0A CN113076538B (en) 2021-04-02 2021-04-02 Method for extracting embedded privacy policy of mobile application APK file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110359392.0A CN113076538B (en) 2021-04-02 2021-04-02 Method for extracting embedded privacy policy of mobile application APK file

Publications (2)

Publication Number Publication Date
CN113076538A CN113076538A (en) 2021-07-06
CN113076538B true CN113076538B (en) 2021-12-14

Family

ID=76614868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110359392.0A Active CN113076538B (en) 2021-04-02 2021-04-02 Method for extracting embedded privacy policy of mobile application APK file

Country Status (1)

Country Link
CN (1) CN113076538B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742773A (en) * 2021-08-31 2021-12-03 平安普惠企业管理有限公司 Privacy bullet frame detection method, device, equipment and storage medium
CN114297700B (en) * 2021-11-11 2022-09-23 北京邮电大学 Dynamic and static combined mobile application privacy protocol extraction method and related equipment
CN114417396B (en) * 2021-12-13 2023-03-24 奇安盘古(上海)信息技术有限公司 Privacy policy text data extraction method and device, electronic equipment and storage medium
CN115630357B (en) * 2022-10-26 2023-09-22 四川大学 Method for judging behavior of collecting personal information by application program crossing boundary

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504335A (en) * 2014-12-24 2015-04-08 中国科学院深圳先进技术研究院 Fishing APP detection method and system based on page feature and URL feature
CN106022127A (en) * 2016-05-10 2016-10-12 江苏通付盾科技有限公司 APK file security detection method and apparatus
CN112199506A (en) * 2020-11-10 2021-01-08 支付宝(杭州)信息技术有限公司 Information detection method, device and equipment for application program
CN112214418A (en) * 2020-12-04 2021-01-12 支付宝(杭州)信息技术有限公司 Application compliance detection method and device and electronic equipment
CN112364165A (en) * 2020-11-12 2021-02-12 上海犇众信息技术有限公司 Automatic classification method based on Chinese privacy policy terms

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133519B (en) * 2017-05-15 2019-07-05 华中科技大学 Privacy compromise detection method and system in a kind of communication of Android application network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504335A (en) * 2014-12-24 2015-04-08 中国科学院深圳先进技术研究院 Fishing APP detection method and system based on page feature and URL feature
CN106022127A (en) * 2016-05-10 2016-10-12 江苏通付盾科技有限公司 APK file security detection method and apparatus
CN112199506A (en) * 2020-11-10 2021-01-08 支付宝(杭州)信息技术有限公司 Information detection method, device and equipment for application program
CN112364165A (en) * 2020-11-12 2021-02-12 上海犇众信息技术有限公司 Automatic classification method based on Chinese privacy policy terms
CN112214418A (en) * 2020-12-04 2021-01-12 支付宝(杭州)信息技术有限公司 Application compliance detection method and device and electronic equipment

Also Published As

Publication number Publication date
CN113076538A (en) 2021-07-06

Similar Documents

Publication Publication Date Title
CN113076538B (en) Method for extracting embedded privacy policy of mobile application APK file
Lin et al. Cross-project transfer representation learning for vulnerable function discovery
CN111459799B (en) Software defect detection model establishing and detecting method and system based on Github
CN108459874B (en) Code automatic summarization method integrating deep learning and natural language processing
CN100485703C (en) Method and system for processing computer malicious code
CN112749284B (en) Knowledge graph construction method, device, equipment and storage medium
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
CN106557695A (en) A kind of malicious application detection method and system
US20050246353A1 (en) Automated transformation of unstructured data
CN112307473A (en) Malicious JavaScript code detection model based on Bi-LSTM network and attention mechanism
US11263062B2 (en) API mashup exploration and recommendation
CN108229170B (en) Software analysis method and apparatus using big data and neural network
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN114386422B (en) Intelligent auxiliary decision-making method and device based on enterprise pollution public opinion extraction
Abebe et al. Towards the extraction of domain concepts from the identifiers
CN112000929A (en) Cross-platform data analysis method, system, equipment and readable storage medium
CN111813443B (en) Method and tool for automatically filling code sample by using Java FX
CN114817924B (en) AST (AST) and cross-layer analysis based android malicious software detection method and system
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
CN110334180B (en) Mobile application security evaluation method based on comment data
CN113869789A (en) Risk monitoring method and device, computer equipment and storage medium
EP3553696A1 (en) Generating a structured document based on a machine readable document and artificial intelligence-generated annotations
CN114238735B (en) Intelligent internet data acquisition method
CN114491530A (en) Android application program classification method based on abstract flow graph and graph neural network
CN113987496A (en) Malicious attack detection method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant