CN117896732B - APP privacy data use purpose consistency analysis method based on large language model - Google Patents
APP privacy data use purpose consistency analysis method based on large language model Download PDFInfo
- Publication number
- CN117896732B CN117896732B CN202410291322.XA CN202410291322A CN117896732B CN 117896732 B CN117896732 B CN 117896732B CN 202410291322 A CN202410291322 A CN 202410291322A CN 117896732 B CN117896732 B CN 117896732B
- Authority
- CN
- China
- Prior art keywords
- data
- triplet
- language model
- usage
- large language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 13
- 238000013480 data collection Methods 0.000 claims abstract description 69
- 230000006399 behavior Effects 0.000 claims abstract description 38
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000012360 testing method Methods 0.000 claims abstract description 22
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000003012 network analysis Methods 0.000 claims abstract description 10
- 230000000875 corresponding effect Effects 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W12/00—Security arrangements; Authentication; Protecting privacy or anonymity
- H04W12/12—Detection or prevention of fraud
- H04W12/128—Anti-malware arrangements, e.g. protection against SMS fraud or mobile malware
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/552—Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W12/00—Security arrangements; Authentication; Protecting privacy or anonymity
- H04W12/12—Detection or prevention of fraud
- H04W12/121—Wireless intrusion detection systems [WIDS]; Wireless intrusion prevention systems [WIPS]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Hardware Design (AREA)
- Storage Device Security (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a large language model-based APP privacy data use purpose consistency analysis method, which comprises the following steps: performing sentence level analysis on the privacy policy text by using a large language model, generating a data collection triplet and a data use triplet, and analyzing whether conflict exists among the tuples to detect whether a data processing rule in the privacy policy text meets consistency; generating a specific task capable of triggering data processing behaviors by using a large language model, combining the large language model with a test input generator to automatically complete the task, capturing network data flow generated in the operation process by using a network analysis tool, analyzing the data use purpose, and extracting a data flow triplet; and comparing the data collection triplets, the data use triplets and the data stream triplets to generate a judging result of whether the use purpose of the privacy data of the mobile APP is consistent with the privacy policy text.
Description
Technical Field
The invention belongs to the technical field of privacy data security, and particularly relates to an APP privacy data use purpose consistency analysis method based on a large language model.
Background
Mobile devices have moved into the area of life, and a wide variety of mobile applications have become an integral part of people's daily life, work, and travel. However, with increasingly more functions of mobile application software, the problem of privacy disclosure becomes more serious, and personal information disclosure events are frequent, so that the problem of privacy protection needs to be solved.
In order to protect the data security of mobile users, the prior work mainly focuses on the type of private data acquired by mobile application software, and little research is conducted on the purpose of using the private data. The privacy policy text of mobile applications and the actual behavior analysis of applications mainly have the following problems:
1) Most privacy policy texts are manually written, the privacy policy texts of different mobile application software have different writing styles and expression modes, and the traditional natural language processing technology is complex and difficult to realize automatic analysis of the privacy policy texts;
2) There is a conflict in the privacy policy text, for a piece of privacy policy text, there may be a conflict in that a previous part declares that certain types of privacy data are not collected, and other parts declare that certain types of privacy data are required to be collected for certain types of functions, so that whether an application has the right to collect the privacy data exists;
3) The prior application actual behavior analysis generally adopts a test input generator to randomly click on a mobile application software option to trigger the application to collect the private data behavior, however, the method can not cover all the private data collection behaviors;
4) Although privacy policy text reveals the purpose of private data collection to users, the data usage in the actual behavior of an application does not always meet its data collection purpose, and few related efforts are focused on the purpose of private data usage.
Disclosure of Invention
The invention aims to solve the technical problems, and provides a large language model-based APP privacy data use purpose consistency analysis method, which improves analysis efficiency and accuracy.
In order to achieve the above purpose, the invention provides a method for analyzing consistency of APP privacy data using purpose based on a large language model, which comprises the following steps:
Step S101, for the software S to be tested, acquiring a privacy policy text thereof, preprocessing the privacy policy text, and acquiring a privacy policy sentence W related to data behaviors;
step S102, defining data collection and data usage triplet extraction rules, Representing the collection of data object d by data receiver r, c represents whether or not to collect,/>Representing whether the data object d is used for the purpose of use p, k representing whether it is used, generating a data collection triplet dc and a data use triplet du from the privacy policy sentence W related to the data behavior using a large language model;
Step S103, detecting whether the data collection triples dc or the data use triples du conflict or not by using a large language model, and if so, judging that the data processing rules in the privacy policy text of the software S to be detected are inconsistent;
Step S104, aiming at privacy policy sentences W related to each data behavior, generating tasks capable of triggering the data processing behavior by using a large language model, and recording the generated task list as L;
Step S105, simulating a user to click a mobile APP interface by using a test input generator, inputting tasks in a task list L to a large language model one by one, analyzing an operation instruction by using the test input generator according to an instruction output by the large language model, executing corresponding actions, continuously and circularly executing until the corresponding tasks are completed in the software S to be tested, and capturing network data flow generated in the operation process by using a network analysis tool;
step S106, extracting a data flow triplet df from the network data traffic, Representing the actual behavior that the data receiver r collects the data object d and is used for the purpose of use p;
Step S107, comparing the data collection triplet dc obtained in the step S102 and the data use triplet du with the data stream triplet df obtained in the step S106, and if the data object d collecting behavior of the data receiver r in the data stream triplet df does not appear in the data collection triplet dc, judging that the privacy data collecting behavior and the privacy policy text of the software S to be tested are inconsistent; if the data object d in the data stream triplet df is used for the usage purpose p action not appearing in the data usage triplet du, the fact that the usage purpose of private data of the software S to be tested is inconsistent with the privacy policy text is judged.
Further, the specific method of step S101 is as follows:
step S201, for the software S to be tested, acquiring privacy policy text thereof;
Step S202, dividing sentences in the privacy policy text according to punctuation marks, and storing the sentences which are mutually independent into a file A;
step S203, a verb vocabulary list is created according to vocabulary word frequency of data collection or action occurrence in the privacy policy text, verb matching is carried out on the file A according to the verb vocabulary list, and privacy policy sentences W related to data actions are screened out. Verbs include, for example, "collect", "use".
Further, the specific method in step S102 is as follows:
The data collection and data usage triplet extraction rules are sent to the large language model, an example template is sent as an example for the large language model to learn, the large language model generates a data collection triplet dc and a data usage triplet du according to privacy policy sentences W related to data behaviors, and when processing involves multiple data objects, the data collection triplet dc and the data usage triplet du are divided into a plurality of data processing tuples which only comprise one data object. Define data receiver r, collect c, use k, use destination p content for application provider/external partner, collect/not collect, use/not use, provide basic services/provide personalized services/secure protection/provide advertisement/personalized advertisement, respectively.
Exemplary cases are as follows: "if you use real-time update weather function, we can collect your location information and device information when your device is in silence state" in order to update the weather of your location in time, corresponding data collection triples= (First party application provider, gather, location information),/>= (Application provider, collection, device information), data use triplet/>= (Location information for providing basic services),/>= (Device information for providing basic services).
Further, the specific method of step S103 is as follows:
Step S401, the data collection triples dc are sent to a large language model, whether data collection behavior conflicts exist is detected, and if one data collection triples dc is used for collecting the data object d1 by the data receiver r1, and if the other data collection triples dc is used for not collecting the data object d1 by the data receiver r1, the data collection triples dc are the first conflicts; if one of the data collection triplets dc data receiver r2 collects data object d2, the other data collection triplet dc data receiver r2 does not collect data object d3, and if d3 includes d2, both are a second conflict; and if at least one of the first conflict and the second conflict exists, judging that the internal data collection rules of the privacy policy text of the software S to be tested are inconsistent.
For example in the case of the following,= (Application provider, collection AndroidID),/>= (Application provider, not collected, device information) device information contains AndroidID etc. information,/>And/>There is a second conflict between them.
Step S402, sending the data usage triplet du to the large language model, detecting whether there is a data usage behavior conflict, if one of the data usage triplet du is used for the usage purpose p1 and the other data usage triplet du is not used for the usage purpose p1, then the data usage triplet du and the data usage triplet du are in a third conflict; if one of the data usage triples du is used for the purpose p2 and the other data usage triplet du is not used for the purpose p2, if d6 includes d5, then both are a fourth conflict; if at least one of the third conflict and the fourth conflict exists, the fact that the usage rules of the data in the privacy policy text W of the software S to be tested are inconsistent is judged.
For example in the case of the following,= (AndroidID for providing personalized services),/>= (Device information, not used for providing personalized services), device information contains AndroidID etc. information,/>And/>There is a second conflict between them.
In step S403, all collected data objects in the data collection triplet dc are compared with all used data objects in the data usage triplet du, if the data usage triplet du uses data objects not in the data collection triplet dc, it is considered that there is an overdue usage data type conflict, and if there is an overdue usage data type conflict, it is determined that the privacy policy text W of the software S under test is inconsistent in overdue usage data type.
For example, in the case that all collected data objects in data collection triplet dc do not contain "AndroidID", all used data objects in data usage triplet du contain "device information" or "AndroidID", data usage triplet du uses data objects not declared in data collection triplet dc, and there is an overuse data type conflict.
Further, the specific method of step S105 is as follows:
Step S501, a test input generator simulates clicking operation of a user on a button on a screen interface of software S to be tested through random clicking, records the result of each clicking operation, and constructs a UI conversion chart UTG including the clicked button, interface elements and executed operations;
Step S502, the test input generator traverses all UI elements in the UI conversion chart and records option information;
Step S503, selecting a task from the task list L, converting the UI state and operation into an HTML format with structured information, transmitting the task, the current UI state description and the option information related to the task to a large language model, and giving a next operation instruction according to the input by the large language model;
Step S504, the test input generator analyzes the operation instruction and executes the corresponding action, after the execution is finished, the task, the current UI interface state description, the history action for executing the task and the option information related to the task are sent to the large language model, and the step S503 is executed circularly until the large language model returns the task completion instruction;
in step S505, network data traffic packets generated during the operation are captured using the network analysis tool.
Further, the method of step S106 is as follows:
step S601, analyzing the network data traffic packet captured in the step S505 by using a network analysis tool, identifying and extracting structured data in the traffic packet, analyzing the identified structured data format, and extracting data in a key-value form to generate key-value pairs;
For example the following cases: in the user identity information request example, the URL is https:// api. Sample/user/profileuser _id=123456 & email= user@example.com, the generated key value pair is user_id 123456,email:user@example.com, the key value pair generated by a POST request registration new device example ,Endpoint: /device/register,Request Body:{"device_id": "abcdef123456", "os_version": "Android 11", "device_model": "Samsung GalaxyS21" }, is device_id abcdef123456, os_version Android 11, device_model Samsung Galaxy S21;
Step S602: matching the key value pair obtained in the step S601 with a preset information character string, extracting a successfully matched key value pair, recording a key value in the key value pair as a data object d, wherein the preset information character string comprises personal identity information, a device identifier, geographic position information, payment information and the like of a user, such as 'user_id', 'IMEI', 'ip_address';
Step S603: the data receiver r and the destination p are obtained according to the destination URL, the transmission data and the application packet name in the network data traffic to generate a data stream triplet df.
The beneficial effects are that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:
1) According to the invention, the large language model is utilized to automatically analyze the privacy policy texts of the mobile application software in different fields, and compared with the manual auditing of the privacy policy texts or the analysis of the privacy policy texts by the conventional natural language processing technology, the efficiency and the accuracy are improved.
2) The invention combines the large language model and the test input generator to trigger the software data collection behavior, and compared with the random trigger of the test input generator to trigger the software data collection behavior, the invention improves the trigger integrity.
3) The invention provides a thought for detecting whether the purpose of using privacy data by mobile application software is consistent with that of privacy policy texts, and the application of a large language model in a privacy policy text detection link and a software dynamic analysis link is an application of an emerging natural language processing technology in the field of software security.
Drawings
FIG. 1 is a general flow chart of a method for analyzing consistency of APP privacy data usage purposes based on a large language model.
FIG. 2 is a flow chart of a method for determining the consistency of data processing rules within the text of a privacy policy of a mobile APP in accordance with the present invention.
Detailed Description
As shown in fig. 1, the present invention provides a method for analyzing consistency of application purpose of APP privacy data based on a large language model, which comprises the following steps:
Step S101, for the software S to be tested, acquiring a privacy policy text thereof, preprocessing the privacy policy text, and acquiring a privacy policy sentence W related to data behaviors;
step S102, defining data collection and data usage triplet extraction rules, Representing the collection of data object d by data receiver r, c represents whether or not to collect,/>Representing whether the data object d is used for the purpose of use p, k representing whether it is used, generating a data collection triplet dc and a data use triplet du from the privacy policy sentence W related to the data behavior using a large language model;
Step S103, detecting whether the data collection triples dc or the data use triples du conflict or not by using a large language model, and if so, judging that the data processing rules in the privacy policy text of the software S to be detected are inconsistent;
Step S104, aiming at privacy policy sentences W related to each data behavior, generating tasks capable of triggering the data processing behavior by using a large language model, and recording the generated task list as L;
Step S105, simulating a user to click a mobile APP interface by using a test input generator, inputting tasks in a task list L to a large language model one by one, analyzing an operation instruction by using the test input generator according to an instruction output by the large language model, executing corresponding actions, continuously and circularly executing until the corresponding tasks are completed in the software S to be tested, and capturing network data flow generated in the operation process by using a network analysis tool;
step S106, extracting a data flow triplet df from the network data traffic, Representing the actual behavior that the data receiver r collects the data object d and is used for the purpose of use p;
Step S107, comparing the data collection triplet dc obtained in the step S102 and the data use triplet du with the data stream triplet df obtained in the step S106, and if the data object d collecting behavior of the data receiver r in the data stream triplet df does not appear in the data collection triplet dc, judging that the privacy data collecting behavior and the privacy policy text of the software S to be tested are inconsistent; if the data object d in the data stream triplet df is used for the usage purpose p action not appearing in the data usage triplet du, the fact that the usage purpose of private data of the software S to be tested is inconsistent with the privacy policy text is judged.
Further, the specific method of step S101 is as follows:
step S201, for the software S to be tested, acquiring privacy policy text thereof;
Step S202, dividing sentences in the privacy policy text according to punctuation marks, and storing the sentences which are mutually independent into a file A;
step S203, a verb vocabulary list is created according to vocabulary word frequency of data collection or action occurrence in the privacy policy text, verb matching is carried out on the file A according to the verb vocabulary list, and privacy policy sentences W related to data actions are screened out. Verbs include, for example, "collect", "use".
Further, the specific method in step S102 is as follows:
the data collection and data usage triplet extraction rules are sent to the large language model, an example template is sent as an example for the large language model to learn, the large language model generates a data collection triplet dc and a data usage triplet du according to privacy policy sentences W related to data behaviors, and when processing involves multiple data objects, the data collection triplet dc and the data usage triplet du are divided into a plurality of data processing tuples which only comprise one data object.
The extraction rules for formulating the data collection triplet dc and the data usage triplet du are as follows:
Defining data receiver r, whether to collect c, whether to use k, use destination p content for application provider/external partner, collect/not collect, use/not use, provide basic services/provide personalized services/security protection/provide advertisement/personalized advertisement, respectively;
Exemplary cases are as follows: "if you use real-time update weather function, we can collect your location information and device information when your device is in silence state" in order to update the weather of your location in time, corresponding data collection triples = (First party application provider, gather, location information),/>= (Application provider, collection, device information), data use triplet/>= (Location information for providing basic services),/>= (Device information for providing basic services).
Further, fig. 2 is a flowchart of a method for determining the consistency of the data processing rules in the privacy policy text of the mobile APP, and the specific method in step S103 is as follows:
Step S401, the data collection triples dc are sent to a large language model, whether data collection behavior conflicts exist is detected, and if one data collection triples dc is used for collecting the data object d1 by the data receiver r1, and if the other data collection triples dc is used for not collecting the data object d1 by the data receiver r1, the data collection triples dc are the first conflicts; if one of the data collection triplets dc data receiver r2 collects data object d2, the other data collection triplet dc data receiver r2 does not collect data object d3, and if d3 includes d2, both are a second conflict; and if at least one of the first conflict and the second conflict exists, judging that the internal data collection rules of the privacy policy text of the software S to be tested are inconsistent.
For example in the case of the following,= (Application provider, collection AndroidID),/>= (Application provider, not collected, device information) device information contains AndroidID etc. information,/>And/>A second conflict exists between the two;
Step S402, sending the data usage triplet du to the large language model, detecting whether there is a data usage behavior conflict, if one of the data usage triplet du is used for the usage purpose p1 and the other data usage triplet du is not used for the usage purpose p1, then the data usage triplet du and the data usage triplet du are in a third conflict; if one of the data usage triples du is used for the purpose p2 and the other data usage triplet du is not used for the purpose p2, if d6 includes d5, then both are a fourth conflict; if at least one of the third conflict and the fourth conflict exists, the fact that the usage rules of the data in the privacy policy text W of the software S to be tested are inconsistent is judged.
For example in the case of the following,= (AndroidID for providing personalized services),/>= (Device information, not used for providing personalized services), device information contains AndroidID etc. information,/>And/>There is a second conflict between them.
In step S403, all collected data objects in the data collection triplet dc are compared with all used data objects in the data usage triplet du, if the data usage triplet du uses data objects not in the data collection triplet dc, it is considered that there is an overdue usage data type conflict, and if there is an overdue usage data type conflict, it is determined that the privacy policy text W of the software S under test is inconsistent in overdue usage data type.
For example, in the case that all collected data objects in data collection triplet dc do not contain "AndroidID", all used data objects in data usage triplet du contain "device information" or "AndroidID", data usage triplet du uses data objects not declared in data collection triplet dc, and there is an overuse data type conflict.
Further, the specific method of step S105 is as follows:
Step S501, a test input generator simulates clicking operation of a user on a button on a screen interface of software S to be tested through random clicking, records the result of each clicking operation, and constructs a UI conversion chart UTG including the clicked button, interface elements and executed operations;
Step S502, the test input generator traverses all UI elements in the UI conversion chart and records option information;
Step S503, selecting a task from the task list L, converting the UI state and operation into an HTML format with structured information, transmitting the task, the current UI state description and the option information related to the task to a large language model, and giving a next operation instruction according to the input by the large language model;
Step S504, the test input generator analyzes the operation instruction and executes the corresponding action, after the execution is finished, the task, the current UI interface state description, the history action for executing the task and the option information related to the task are sent to the large language model, and the step S503 is executed circularly until the large language model returns the task completion instruction;
in step S505, network data traffic packets generated during the operation are captured using the network analysis tool.
Further, the method of step S106 is as follows:
step S601, analyzing the network data traffic packet captured in the step S505 by using a network analysis tool, identifying and extracting structured data in the traffic packet, analyzing the identified structured data format, and extracting data in a key-value form to generate key-value pairs;
For example the following cases: in the user identity information request example, the URL is https:// api. Sample/user/profileuser _id=123456 & email= user@example.com, the generated key value pair is user_id 123456,email:user@example.com, the key value pair generated by a POST request registration new device example ,Endpoint: /device/register,Request Body:{"device_id": "abcdef123456", "os_version": "Android 11", "device_model": "Samsung GalaxyS21" }, is device_id abcdef123456, os_version Android 11, device_model Samsung Galaxy S21;
Step S602: matching the key value pair obtained in the step S601 with a preset information character string, extracting a successfully matched key value pair, recording a key value in the key value pair as a data object d, wherein the preset information character string comprises personal identity information, a device identifier, geographic position information, payment information and the like of a user, such as 'user_id', 'IMEI', 'ip_address';
Step S603: the data receiver r and the destination p are obtained according to the destination URL, the transmission data and the application packet name in the network data traffic to generate a data stream triplet df.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.
Claims (6)
1. The APP privacy data use purpose consistency analysis method based on the large language model is characterized by comprising the following steps of:
Step S101, for the software S to be tested, acquiring a privacy policy text thereof, preprocessing the privacy policy text, and acquiring a privacy policy sentence W related to data behaviors;
Step S102, defining data collection and data usage triplet extraction rules, Representing the collection of data object d by data receiver r, c represents whether or not to collect,/>Representing whether the data object d is used for the purpose of use p, k representing whether it is used, generating a data collection triplet dc and a data use triplet du from the privacy policy sentence W related to the data behavior using a large language model;
Step S103, detecting whether the data collection triples dc or the data use triples du conflict or not by using a large language model, and if so, judging that the data processing rules in the privacy policy text of the software S to be detected are inconsistent;
Step S104, aiming at privacy policy sentences W related to each data behavior, generating tasks capable of triggering the data processing behavior by using a large language model, and recording the generated task list as L;
Step S105, simulating a user to click a mobile APP interface by using a test input generator, inputting tasks in a task list L to a large language model one by one, analyzing an operation instruction by using the test input generator according to an instruction output by the large language model, executing corresponding actions, continuously and circularly executing until the corresponding tasks are completed in the software S to be tested, and capturing network data flow generated in the operation process by using a network analysis tool;
step S106, extracting a data flow triplet df from the network data traffic, Representing the actual behavior that the data receiver r collects the data object d and is used for the purpose of use p;
Step S107, comparing the data collection triplet dc obtained in the step S102 and the data use triplet du with the data stream triplet df obtained in the step S106, and if the data object d collecting behavior of the data receiver r in the data stream triplet df does not appear in the data collection triplet dc, judging that the privacy data collecting behavior and the privacy policy text of the software S to be tested are inconsistent; if the data object d in the data stream triplet df is used for the usage purpose p action not appearing in the data usage triplet du, the fact that the usage purpose of private data of the software S to be tested is inconsistent with the privacy policy text is judged.
2. The method for analyzing the consistency of the application privacy data using purpose based on the large language model as claimed in claim 1, wherein the specific method of the step S101 is as follows:
step S201, for the software S to be tested, acquiring privacy policy text thereof;
Step S202, dividing sentences in the privacy policy text according to punctuation marks, and storing the sentences which are mutually independent into a file A;
Step S203, a verb vocabulary list is created according to vocabulary word frequency of data collection or action occurrence in the privacy policy text, verb matching is carried out on the file A according to the verb vocabulary list, and privacy policy sentences W related to data actions are screened out.
3. The method for analyzing consistency of application privacy data using purpose based on large language model as claimed in claim 1, wherein the specific method in step S102 is as follows: the data collection and data usage triplet extraction rules are sent to the large language model, an example template is sent as an example for the large language model to learn, the large language model generates a data collection triplet dc and a data usage triplet du according to privacy policy sentences W related to data behaviors, and when processing involves multiple data objects, the data collection triplet dc and the data usage triplet du are divided into a plurality of data processing tuples which only comprise one data object.
4. The method for analyzing the consistency of the application privacy data using purpose based on the large language model as claimed in claim 1, wherein the specific method in the step S103 is as follows:
Step S401, the data collection triples dc are sent to a large language model, whether data collection behavior conflicts exist is detected, and if one data collection triples dc is used for collecting the data object d1 by the data receiver r1, and if the other data collection triples dc is used for not collecting the data object d1 by the data receiver r1, the data collection triples dc are the first conflicts; if one of the data collection triplets dc data receiver r2 collects data object d2, the other data collection triplet dc data receiver r2 does not collect data object d3, and if d3 includes d2, both are a second conflict; if at least one of the first conflict and the second conflict exists, judging that the internal data collection rule of the privacy policy text of the software S to be tested is inconsistent;
Step S402, sending the data usage triplet du to the large language model, detecting whether there is a data usage behavior conflict, if one of the data usage triplet du is used for the usage purpose p1 and the other data usage triplet du is not used for the usage purpose p1, then the data usage triplet du and the data usage triplet du are in a third conflict; if one of the data usage triples du is used for the purpose p2 and the other data usage triplet du is not used for the purpose p2, if d6 includes d5, then both are a fourth conflict; if at least one of the third conflict and the fourth conflict exists, judging that the use rules of the data in the privacy policy text W of the software S to be tested are inconsistent;
In step S403, all collected data objects in the data collection triplet dc are compared with all used data objects in the data usage triplet du, if the data usage triplet du uses data objects not in the data collection triplet dc, it is considered that there is an overdue usage data type conflict, and if there is an overdue usage data type conflict, it is determined that the privacy policy text W of the software S under test is inconsistent in overdue usage data type.
5. The method for analyzing the consistency of the application privacy data using purpose based on the large language model as claimed in claim 1, wherein the specific method in the step S105 is as follows:
Step S501, a test input generator simulates clicking operation of a user on a button on a screen interface of software S to be tested through random clicking, records the result of each clicking operation, and constructs a UI conversion chart UTG including the clicked button, interface elements and executed operations;
Step S502, the test input generator traverses all UI elements in the UI conversion chart and records option information;
Step S503, selecting a task from the task list L, converting the UI state and operation into an HTML format with structured information, transmitting the task, the current UI state description and the option information related to the task to a large language model, and giving a next operation instruction according to the input by the large language model;
Step S504, the test input generator analyzes the operation instruction and executes the corresponding action, after the execution is finished, the task, the current UI interface state description, the history action for executing the task and the option information related to the task are sent to the large language model, and the step S503 is executed circularly until the large language model returns the task completion instruction;
in step S505, network data traffic packets generated during the operation are captured using the network analysis tool.
6. The method for analyzing consistency of application privacy data using purpose based on large language model as claimed in claim 5, wherein the method in step S106 is as follows:
step S601, analyzing the network data traffic packet captured in the step S505 by using a network analysis tool, identifying and extracting structured data in the traffic packet, analyzing the identified structured data format, and extracting data in a key-value form to generate key-value pairs;
step S602: matching the preset information character string with the key value pair obtained in the step S601, extracting the successfully matched key value pair, and recording the key value in the key value pair as a data object d;
Step S603: the data receiver r and the destination p are obtained according to the destination URL, the transmission data and the application packet name in the network data traffic to generate a data stream triplet df.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410291322.XA CN117896732B (en) | 2024-03-14 | 2024-03-14 | APP privacy data use purpose consistency analysis method based on large language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410291322.XA CN117896732B (en) | 2024-03-14 | 2024-03-14 | APP privacy data use purpose consistency analysis method based on large language model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117896732A CN117896732A (en) | 2024-04-16 |
CN117896732B true CN117896732B (en) | 2024-05-28 |
Family
ID=90643082
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410291322.XA Active CN117896732B (en) | 2024-03-14 | 2024-03-14 | APP privacy data use purpose consistency analysis method based on large language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117896732B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111062490A (en) * | 2019-12-13 | 2020-04-24 | 支付宝(杭州)信息技术有限公司 | Method and device for processing network data containing private data |
CN115630357A (en) * | 2022-10-26 | 2023-01-20 | 四川大学 | Method for judging behavior of collecting personal information when application program crosses border |
CN116595977A (en) * | 2023-05-21 | 2023-08-15 | 深圳市元世界软件科技有限公司 | Method for detecting and protecting personal information in large language model |
WO2023161630A1 (en) * | 2022-02-22 | 2023-08-31 | Unlikely Artificial Intelligence Limited | Computer implemented methods for the automated analysis or use of data, including use of a large language model |
CN116821960A (en) * | 2023-06-20 | 2023-09-29 | 西安交通大学 | Method for detecting rule violations of applet privacy protection policy exhibition |
CN116933316A (en) * | 2023-07-24 | 2023-10-24 | 中国人民解放军战略支援部队信息工程大学 | Method and device for analyzing consistency of intelligent terminal application sensitive behavior and privacy policy |
-
2024
- 2024-03-14 CN CN202410291322.XA patent/CN117896732B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111062490A (en) * | 2019-12-13 | 2020-04-24 | 支付宝(杭州)信息技术有限公司 | Method and device for processing network data containing private data |
WO2023161630A1 (en) * | 2022-02-22 | 2023-08-31 | Unlikely Artificial Intelligence Limited | Computer implemented methods for the automated analysis or use of data, including use of a large language model |
CN115630357A (en) * | 2022-10-26 | 2023-01-20 | 四川大学 | Method for judging behavior of collecting personal information when application program crosses border |
CN116595977A (en) * | 2023-05-21 | 2023-08-15 | 深圳市元世界软件科技有限公司 | Method for detecting and protecting personal information in large language model |
CN116821960A (en) * | 2023-06-20 | 2023-09-29 | 西安交通大学 | Method for detecting rule violations of applet privacy protection policy exhibition |
CN116933316A (en) * | 2023-07-24 | 2023-10-24 | 中国人民解放军战略支援部队信息工程大学 | Method and device for analyzing consistency of intelligent terminal application sensitive behavior and privacy policy |
Non-Patent Citations (1)
Title |
---|
智能手机用户隐私安全保障机制研究――基于第三方应用程序"隐私条款"的分析;何培育;王潇睿;;情报理论与实践;20180509(10);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117896732A (en) | 2024-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10755178B2 (en) | System and method for determining credibility of information based on many remarks on a network, and non-transitory computer readable storage medium having stored thereon computer program therefor | |
Bacchelli et al. | Content classification of development emails | |
Wiese et al. | Who is who in the mailing list? comparing six disambiguation heuristics to identify multiple addresses of a participant | |
CN105530265B (en) | A kind of mobile Internet malicious application detection method based on frequent item set description | |
CN102323873B (en) | In order to trigger the method and system that icon is replied in instant messaging | |
US20130035929A1 (en) | Information processing apparatus and method | |
US20110137884A1 (en) | Techniques for automatically integrating search features within an application | |
Wong et al. | Design of a crawler for online social networks analysis | |
CN109840300A (en) | Internet public opinion analysis method, apparatus, equipment and computer readable storage medium | |
CN109903122A (en) | House prosperity transaction information processing method, device, equipment and storage medium | |
CN112286815A (en) | Interface test script generation method and related equipment thereof | |
CN112861046B (en) | SEO website, method, system, terminal and medium for optimizing search engine | |
CN117896732B (en) | APP privacy data use purpose consistency analysis method based on large language model | |
CN109559121A (en) | Transaction path calls exception analysis method, device, equipment and readable storage medium storing program for executing | |
CN112256959A (en) | Method for analyzing information collected by WeChat public number small program | |
CN116049808B (en) | Equipment fingerprint acquisition system and method based on big data | |
JP2019101889A (en) | Test execution device and program | |
CN111259050A (en) | User operation track recording method and device, computer equipment and storage medium | |
CN111552785A (en) | Method and device for updating database of human-computer interaction system, computer equipment and medium | |
CN114282261B (en) | Fine granularity privacy policy and mobile application behavior consistency checking method | |
US20150032749A1 (en) | Method of creating classification pattern, apparatus, and recording medium | |
WO2023060664A1 (en) | Abnormal device identification method and apparatus, and computer device and storage medium | |
CN110263082B (en) | Data distribution analysis method and device of database, electronic equipment and storage medium | |
CN105677827B (en) | A kind of acquisition methods and device of list | |
CN112784132B (en) | Data acquisition method and device and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |