CN117896732B

CN117896732B - A method for consistency analysis of APP privacy data usage purpose based on large language model

Info

Publication number: CN117896732B
Application number: CN202410291322.XA
Authority: CN
Inventors: 张伟; 徐天辰; 陈云芳
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2024-03-14
Filing date: 2024-03-14
Publication date: 2024-05-28
Anticipated expiration: 2044-03-14
Also published as: CN117896732A

Abstract

The invention discloses a large language model-based APP privacy data use purpose consistency analysis method, which comprises the following steps: performing sentence level analysis on the privacy policy text by using a large language model, generating a data collection triplet and a data use triplet, and analyzing whether conflict exists among the tuples to detect whether a data processing rule in the privacy policy text meets consistency; generating a specific task capable of triggering data processing behaviors by using a large language model, combining the large language model with a test input generator to automatically complete the task, capturing network data flow generated in the operation process by using a network analysis tool, analyzing the data use purpose, and extracting a data flow triplet; and comparing the data collection triplets, the data use triplets and the data stream triplets to generate a judging result of whether the use purpose of the privacy data of the mobile APP is consistent with the privacy policy text.

Description

A method for consistency analysis of APP privacy data usage purpose based on large language model

技术领域Technical Field

本发明属于隐私数据安全技术领域，尤其涉及一种基于大语言模型的APP隐私数据使用目的一致性分析方法。The present invention belongs to the field of privacy data security technology, and in particular to a method for analyzing the consistency of APP privacy data usage purposes based on a large language model.

背景技术Background technique

移动设备已经走进了生活的方方面面，形形色色的移动应用已然成为人们日常生活、工作、出行密不可分的一部分。然而，伴随着移动应用软件功能日益繁多，隐私泄露问题也变得更加严重，个人信息泄露事件屡见不鲜，隐私保护问题亟待解决。Mobile devices have entered every aspect of life, and various mobile applications have become an inseparable part of people's daily life, work, and travel. However, with the increasing number of mobile application software functions, the problem of privacy leakage has become more serious, and personal information leakage incidents are common. The issue of privacy protection needs to be solved urgently.

为了保护移动用户的数据安全，以往工作主要关注移动应用软件获取的隐私数据种类，少有关于隐私数据使用目的研究。移动应用的隐私政策文本与应用实际行为分析主要存在以下几种问题：In order to protect the data security of mobile users, previous work has mainly focused on the types of private data obtained by mobile applications, and there is little research on the purpose of using private data. The analysis of the privacy policy text and actual application behavior of mobile applications mainly has the following problems:

1）大部分隐私政策文本为人工编写，不同移动应用软件的隐私政策文本编写风格和表达模式不同，传统的自然语言处理技术实现隐私政策文本自动化分析较为复杂且存在困难；1) Most privacy policy texts are manually written. The writing styles and expression patterns of privacy policy texts of different mobile applications are different. It is relatively complex and difficult to use traditional natural language processing technology to automatically analyze privacy policy texts.

2）隐私政策文本存在矛盾，对于一份隐私政策文本，可能存在前一部分声明不会收集某类隐私数据，在其他部分声明出于某类功能需要收集该类隐私数据，导致应用是否有权收集该隐私数据存在冲突；2) There are contradictions in the privacy policy text. For a privacy policy text, the first part may state that a certain type of private data will not be collected, while another part may state that the private data needs to be collected for a certain function, resulting in a conflict as to whether the application has the right to collect the private data;

3）现有的应用实际行为分析，一般采用测试输入生成器随机点击移动应用软件选项触发应用收集隐私数据行为，然而这种方式可能无法覆盖所有隐私数据收集行为；3) Existing application actual behavior analysis generally uses a test input generator to randomly click on mobile application software options to trigger the application to collect private data. However, this method may not cover all private data collection behaviors;

4）虽然隐私政策文本向用户披露了隐私数据收集的目的，但应用实际行为中的数据使用并不总是符合其数据收集目的，很少有相关工作聚焦于隐私数据使用目的方面。4) Although the privacy policy text discloses the purpose of collecting private data to users, the data usage in the actual behavior of the application does not always conform to its data collection purpose, and few related works focus on the purpose of using private data.

发明内容Summary of the invention

发明目的，为了解决上述技术问题，本发明提出一种基于大语言模型的APP隐私数据使用目的一致性分析方法，提高分析效率和准确率。Purpose of the invention: In order to solve the above technical problems, the present invention proposes a method for analyzing the consistency of APP privacy data usage purposes based on a large language model to improve analysis efficiency and accuracy.

技术方案，为了实现上述发明目的，本发明提出一种基于大语言模型的APP隐私数据使用目的一致性分析方法，该方法包括以下步骤：Technical solution, in order to achieve the above invention objectives, the present invention proposes a method for analyzing the consistency of APP privacy data usage purposes based on a large language model, the method comprising the following steps:

步骤S101，对于待测软件S，获取其隐私政策文本，对隐私政策文本进行预处理,获得数据行为相关的隐私政策句子W；Step S101, for the software S to be tested, obtain its privacy policy text, pre-process the privacy policy text, and obtain the privacy policy sentence W related to the data behavior;

步骤S102，定义数据收集和数据使用三元组提取规则，表示数据接收者r对数据对象d的收集情况，c表示是否收集，/>表示数据对象d是否用于使用目的p，k代表是否使用，使用大语言模型从数据行为相关的隐私政策句子W中生成数据收集三元组dc和数据使用三元组du；Step S102, defining data collection and data usage triple extraction rules, Indicates the collection of data object d by data receiver r, c indicates whether to collect, /> Indicates whether the data object d is used for the purpose p, k represents whether it is used, and uses a large language model to generate data collection triples dc and data usage triples du from the privacy policy sentences W related to data behavior;

步骤S103，使用大语言模型检测数据收集三元组dc或数据使用三元组du是否冲突，若存在冲突，则判定待测软件S的隐私政策文本内部数据处理规则不一致；Step S103, using a large language model to detect whether the data collection triple dc or the data use triple du conflicts. If there is a conflict, it is determined that the internal data processing rules of the privacy policy text of the tested software S are inconsistent;

步骤S104，针对各个数据行为相关的隐私政策句子W，使用大语言模型生成能够触发数据处理行为的任务，记生成任务清单为L；Step S104: for each privacy policy sentence W related to data behavior, use the large language model to generate tasks that can trigger data processing behaviors, and the generated task list is recorded as L;

步骤S105，用测试输入生成器模拟用户点击移动APP界面，向大语言模型逐一输入任务清单L中的任务，根据大语言模型输出的指令，测试输入生成器分析操作指令并执行相应动作，不断循环执行直至待测软件S中完成相应任务，使用网络分析工具捕获操作过程中产生的网络数据流量；Step S105, using the test input generator to simulate a user clicking on the mobile APP interface, inputting the tasks in the task list L into the large language model one by one, and analyzing the operation instructions and executing corresponding actions according to the instructions output by the large language model, and continuously looping until the corresponding tasks are completed in the software to be tested S, and using the network analysis tool to capture the network data traffic generated during the operation;

步骤S106，从网络数据流量中提取数据流三元组df，表示实际行为中数据接收者r收集数据对象d，并用于使用目的p；Step S106, extracting the data flow triplet df from the network data flow, In actual behavior, the data receiver r collects the data object d and uses it for the purpose p;

步骤S107，将步骤S102获得的数据收集三元组dc和数据使用三元组du与步骤S106获得的数据流三元组df进行比较，如果数据流三元组df中数据接收者r收集数据对象d行为未出现在数据收集三元组dc中，则判定待测软件S隐私数据收集行为和隐私政策文本不一致；如果数据流三元组df中数据对象d用于使用目的p行为未出现在数据使用三元组du中，则判定待测软件S隐私数据使用目的和隐私政策文本不一致。Step S107, compare the data collection triplet dc and the data usage triplet du obtained in step S102 with the data flow triplet df obtained in step S106. If the behavior of collecting data object d by the data recipient r in the data flow triplet df does not appear in the data collection triplet dc, then it is determined that the privacy data collection behavior of the software under test S is inconsistent with the privacy policy text; if the behavior of using data object d for purpose p in the data flow triplet df does not appear in the data usage triplet du, then it is determined that the privacy data usage purpose of the software under test S is inconsistent with the privacy policy text.

进一步的，所述步骤S101具体方法如下：Furthermore, the specific method of step S101 is as follows:

步骤S201，对于待测软件S，获取其隐私政策文本；Step S201, for the software to be tested S, obtain its privacy policy text;

步骤S202，根据标点符号将隐私政策文本中的句子进行分割，并将相互独立的句子保存到文件A中；Step S202, segmenting the sentences in the privacy policy text according to punctuation marks, and saving the independent sentences into file A;

步骤S203，根据隐私政策文本中数据收集或使用动作出现的词汇词频，创建动词词汇清单，根据动词词汇清单对文件A进行动词匹配，筛选出数据行为相关的隐私政策句子W。动词包括如“收集”、“使用”。Step S203: Create a verb vocabulary list based on the frequency of words that appear in the data collection or use actions in the privacy policy text, perform verb matching on file A based on the verb vocabulary list, and filter out privacy policy sentences W related to data behavior. Verbs include "collect" and "use".

进一步的，所述步骤S102具体方法为：Furthermore, the specific method of step S102 is:

向大语言模型发送数据收集和数据使用三元组提取规则，并发送示例模板作为示例供大语言模型学习，大语言模型根据数据行为相关的隐私政策句子W生成数据收集三元组dc和数据使用三元组du，处理涉及多种数据对象时，划分成多个只包含一种数据对象的数据处理元组。定义数据接收者r、是否收集c、是否用于k、使用目的p内容分别为应用提供方/外部合作方、收集/不收集、用于/不用于、提供基础服务/提供个性化服务/安全保护/提供广告/个性化广告。Send data collection and data use triple extraction rules to the large language model, and send sample templates as examples for the large language model to learn. The large language model generates data collection triples dc and data use triples du based on the privacy policy sentences W related to data behavior. When processing multiple data objects, it is divided into multiple data processing tuples containing only one data object. Define data recipient r, whether to collect c, whether to use k, and purpose of use p as application provider/external partner, collect/not collect, use/not use, provide basic services/provide personalized services/security protection/provide advertising/personalized advertising.

示范案例如下：“如您使用实时更新天气功能，为了及时更新您所处位置的天气，我们会在您的设备处于静默状态时收集您的位置信息和设备信息”，对应的数据收集三元组=(第一方应用提供方,收集,位置信息)，/>=（应用提供方，收集，设备信息），数据使用三元组/>=(位置信息，用于，提供基础服务)，/>=(设备信息，用于，提供基础服务)。The following is an example: "If you use the real-time weather update function, in order to update the weather at your location in a timely manner, we will collect your location information and device information when your device is in silent mode", the corresponding data collection triplet =(first-party application provider, collection, location information), /> = (application provider, collection, device information), data usage triple /> =(location information, used to provide basic services),/> =(Device information, used to provide basic services).

进一步的，所述步骤S103具体方法如下：Furthermore, the specific method of step S103 is as follows:

步骤S401，将数据收集三元组dc发送给大语言模型，检测是否存在数据收集行为冲突，如果其中一个数据收集三元组dc中数据接收者r1收集数据对象d1，另一个数据收集三元组dc中数据接收者r1不收集数据对象d1，则二者为第一冲突；如果其中一个数据收集三元组dc中数据接收者r2收集数据对象d2，另一个数据收集三元组dc中数据接收者r2不收集数据对象d3，并且，如果d3包括d2，则二者为第二冲突；如果第一冲突和第二冲突至少存在一个，则判定待测软件S的隐私政策文本内部数据收集规则不一致。Step S401, sending the data collection triplet dc to the large language model to detect whether there is a conflict in data collection behavior. If the data receiver r1 in one of the data collection triples dc collects the data object d1, and the data receiver r1 in the other data collection triplet dc does not collect the data object d1, then the two are a first conflict; if the data receiver r2 in one of the data collection triples dc collects the data object d2, and the data receiver r2 in the other data collection triplet dc does not collect the data object d3, and if d3 includes d2, then the two are a second conflict; if at least one of the first conflict and the second conflict exists, it is determined that the internal data collection rules of the privacy policy text of the software S to be tested are inconsistent.

例如以下案例，=（应用提供方，收集，AndroidID），/>=（应用提供方，不收集，设备信息），设备信息包含AndroidID等信息，/>和/>之间存在第二冲突。For example, the following case: =(application provider, collection, AndroidID), /> = (Application provider, does not collect, device information), device information includes AndroidID and other information, /> and/> There is a second conflict.

步骤S402，将数据使用三元组du发送给大语言模型，检测是否存在数据使用行为冲突，如果其中一个数据使用三元组du中数据对象d4用于使用目的p1，另一个数据使用三元组du中数据对象d4不用于使用目的p1，则二者为第三冲突；如果其中一个数据使用三元组du中数据对象d5用于使用目的p2，另一个数据使用三元组du中数据对象d6不用于使用目的p2，如果d6包括d5，则二者为第四冲突；如果第三冲突和第四冲突至少存在一个，判定待测软件S的隐私政策文本W内部数据使用规则不一致。Step S402, send the data usage triplet du to the large language model to detect whether there is a data usage behavior conflict. If the data object d4 in one of the data usage triples du is used for the usage purpose p1, and the data object d4 in the other data usage triplet du is not used for the usage purpose p1, then the two are in the third conflict; if the data object d5 in one of the data usage triples du is used for the usage purpose p2, and the data object d6 in the other data usage triplet du is not used for the usage purpose p2, if d6 includes d5, then the two are in the fourth conflict; if at least one of the third conflict and the fourth conflict exists, it is determined that the internal data usage rules of the privacy policy text W of the software under test S are inconsistent.

例如以下案例，=（AndroidID，用于，提供个性化服务），/>=（设备信息，不用于，提供个性化服务），设备信息包含AndroidID等信息，/>和/>之间存在第二冲突。For example, the following case: =(AndroidID, used to provide personalized services),/> = (device information, not used to provide personalized services), device information includes AndroidID and other information, /> and/> There is a second conflict.

步骤S403，对比数据收集三元组dc中所有收集的数据对象与数据使用三元组du中所有使用的数据对象，若数据使用三元组du使用了未在数据收集三元组dc中的数据对象，认为存在超界使用数据类型冲突，若存在超界使用数据类型冲突，判定待测软件S的隐私政策文本W超界使用数据类型不一致。Step S403, compare all collected data objects in the data collection triplet dc with all used data objects in the data usage triplet du. If the data usage triplet du uses a data object that is not in the data collection triplet dc, it is considered that there is an out-of-bounds use of data type conflict. If there is an out-of-bounds use of data type conflict, it is determined that the privacy policy text W of the software under test S is inconsistent with the out-of-bounds use of data type.

例如以下案例，数据收集三元组dc中所有收集的数据对象不包含“AndroidID”，数据使用三元组du中所有使用的数据对象包含“设备信息”或“AndroidID”，数据使用三元组du使用了未在数据收集三元组dc中声明的数据对象，存在超界使用数据类型冲突。For example, in the following case, all collected data objects in the data collection triplet dc do not contain "AndroidID", and all used data objects in the data usage triplet du contain "device information" or "AndroidID". The data usage triplet du uses data objects that are not declared in the data collection triplet dc, resulting in an out-of-bounds data type conflict.

进一步的，所述步骤S105具体方法如下：Furthermore, the specific method of step S105 is as follows:

步骤S501，测试输入生成器通过随机点击模拟用户对待测软件S屏幕界面上的按钮的点击操作，将每次点击操作的结果记录下来，包括点击的按钮、界面元素、以及执行的操作，构建一个UI转换图UTG；Step S501, the test input generator simulates the user's click operation on the buttons on the screen interface of the software to be tested S by randomly clicking, records the result of each click operation, including the clicked button, interface element, and the performed operation, and constructs a UI transition graph UTG;

步骤S502，测试输入生成器遍历UI转换图中所有UI元素，并记录选项信息；Step S502, the test input generator traverses all UI elements in the UI transition diagram and records option information;

步骤S503，从任务清单L中选择一个任务，将UI状态和操作转化为具有结构化信息的HTML格式，将任务、当前UI界面状态描述以及与任务相关的选项信息发送给大语言模型，大语言模型根据输入给出下一步操作指令；Step S503, select a task from the task list L, convert the UI state and operation into HTML format with structured information, send the task, current UI interface state description and option information related to the task to the large language model, and the large language model gives the next operation instruction according to the input;

步骤S504，测试输入生成器分析操作指令并执行相应动作，执行完毕后，将任务、当前UI界面状态描述、执行任务的历史动作及与任务相关的选项信息发送给大语言模型，循环执行步骤S503，直至大语言模型返回任务完成指令；Step S504: the test input generator analyzes the operation instruction and executes the corresponding action. After the execution is completed, the task, the current UI interface state description, the historical action of executing the task and the option information related to the task are sent to the large language model, and step S503 is executed repeatedly until the large language model returns the task completion instruction;

步骤S505，使用网络分析工具捕获操作过程中产生的网络数据流量包。Step S505: Use a network analysis tool to capture network data traffic packets generated during the operation.

进一步的，所述步骤S106方法如下：Furthermore, the method of step S106 is as follows:

步骤S601，使用网络分析工具分析步骤S505捕获的网络数据流量包，识别并提取流量包中的结构化数据，针对识别到的结构化数据格式进行解析，提取其中key-value形式的数据以生成键值对；Step S601, using a network analysis tool to analyze the network data traffic packet captured in step S505, identifying and extracting structured data in the traffic packet, parsing the identified structured data format, extracting the data in key-value format therein to generate a key-value pair;

例如以下案例：用户身份信息请求示例中，URL为https://api.example.com/user/profile?user_id=123456&email=user@example.com，生成的键值对为user_id:123456，email:user@example.com，如下是一个POST请求注册新设备示例，Endpoint: /device/register，Request Body:{"device_id": "abcdef123456", "os_version": "Android 11", "device_model": "Samsung GalaxyS21" }，生成的键值对为device_id:abcdef123456，os_version: Android 11，device_model: Samsung Galaxy S21；For example, in the following case: In the user identity information request example, the URL is https://api.example.com/user/profile?user_id=123456&email=user@example.com, and the generated key-value pair is user_id:123456, email:user@example.com. The following is an example of a POST request to register a new device, Endpoint: /device/register, Request Body:{"device_id": "abcdef123456", "os_version": "Android 11", "device_model": "Samsung GalaxyS21" }, and the generated key-value pair is device_id:abcdef123456, os_version: Android 11, device_model: Samsung Galaxy S21;

步骤S602：将预设信息字符串作对步骤S601获得的键值对进行匹配，提取匹配成功的键值对，将键值对中的key值记录为数据对象d，预设信息字符串包括用户的个人身份信息、设备标识符、地理位置信息、支付信息等，如“user_id”、“IMEI”、“ip_address”；Step S602: Match the preset information string with the key-value pair obtained in step S601, extract the successfully matched key-value pair, and record the key value in the key-value pair as the data object d. The preset information string includes the user's personal identity information, device identifier, geographic location information, payment information, etc., such as "user_id", "IMEI", and "ip_address".

步骤S603：根据网络数据流量中的目的地URL、发送数据和应用包名称获取数据接收者r与使用目的p以生成数据流三元组df。Step S603: Obtain the data receiver r and the usage purpose p according to the destination URL, the sent data and the application package name in the network data flow to generate a data flow triplet df.

有益效果：与现有技术相比，本发明的技术方案具有以下有益技术效果：Beneficial effects: Compared with the prior art, the technical solution of the present invention has the following beneficial technical effects:

1）本发明利用大语言模型对不同领域的移动应用软件隐私政策文本进行自动化分析，相比于人工审核隐私政策文本或以往自然语言处理技术分析隐私政策文本，效率和准确率有所提高。1) The present invention uses a large language model to automatically analyze the privacy policy texts of mobile application software in different fields. Compared with manual review of privacy policy texts or analysis of privacy policy texts using previous natural language processing technology, the efficiency and accuracy are improved.

2）本发明结合大语言模型和测试输入生成器触发软件数据收集行为，相比于测试输入生成器随机触发软件数据收集行为，提高了触发的完整性。2) The present invention combines a large language model with a test input generator to trigger software data collection behavior, which improves the integrity of the trigger compared to the test input generator randomly triggering software data collection behavior.

3）本发明提供了一种检测移动应用软件使用隐私数据目的是否与隐私政策文本所述一致的思路，将大语言模型应用在隐私政策文本检测环节与软件动态分析环节，是新兴自然语言处理技术在软件安全领域的应用。3) The present invention provides a method for detecting whether the purpose of using privacy data by mobile application software is consistent with the text of the privacy policy. The large language model is applied to the privacy policy text detection link and the software dynamic analysis link. It is an application of the emerging natural language processing technology in the field of software security.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明一种基于大语言模型的APP隐私数据使用目的一致性分析方法的整体流程图。FIG1 is an overall flow chart of a method for analyzing consistency of APP privacy data usage purposes based on a large language model according to the present invention.

图2为本发明判定移动APP隐私政策文本内部数据处理规则一致性的方法流程图。FIG2 is a flow chart of a method for determining the consistency of data processing rules within a mobile APP privacy policy text according to the present invention.

具体实施方式Detailed ways

如图1所示，本发明提出一种基于大语言模型的APP隐私数据使用目的一致性分析方法，该方法包括以下步骤：As shown in FIG1 , the present invention proposes a method for analyzing the consistency of APP privacy data usage purposes based on a large language model, and the method comprises the following steps:

向大语言模型发送数据收集和数据使用三元组提取规则，并发送示例模板作为示例供大语言模型学习，大语言模型根据数据行为相关的隐私政策句子W生成数据收集三元组dc和数据使用三元组du，处理涉及多种数据对象时，划分成多个只包含一种数据对象的数据处理元组。The data collection and data usage triples extraction rules are sent to the large language model, and the example template is sent as an example for the large language model to learn. The large language model generates data collection triples dc and data usage triples du according to the privacy policy sentence W related to the data behavior. When processing multiple data objects, it is divided into multiple data processing tuples containing only one data object.

制定数据收集三元组dc和数据使用三元组du的提取规则如下：The extraction rules for the data collection triplet dc and the data usage triplet du are formulated as follows:

定义数据接收者r、是否收集c、是否用于k、使用目的p内容分别为应用提供方/外部合作方、收集/不收集、用于/不用于、提供基础服务/提供个性化服务/安全保护/提供广告/个性化广告；Define data recipient r, whether to collect c, whether to use k, and purpose of use p, which are application provider/external partner, collect/not collect, use/not use, provide basic services/provide personalized services/security protection/provide advertising/personalized advertising;

进一步的，图2为判定移动APP隐私政策文本内部数据处理规则一致性的方法流程图，所述步骤S103具体方法如下：Further, FIG2 is a flow chart of a method for determining the consistency of data processing rules within the mobile APP privacy policy text, and the specific method of step S103 is as follows:

例如以下案例，=（应用提供方，收集，AndroidID），/>=（应用提供方，不收集，设备信息），设备信息包含AndroidID等信息，/>和/>之间存在第二冲突；For example, the following case: =(application provider, collection, AndroidID), /> = (Application provider, does not collect, device information), device information includes AndroidID and other information, /> and/> There is a second conflict between;

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principle of the present invention. These improvements and modifications should also be regarded as the scope of protection of the present invention.

Claims

1. The APP privacy data use purpose consistency analysis method based on the large language model is characterized by comprising the following steps of:

Step S101, for the software S to be tested, acquiring a privacy policy text thereof, preprocessing the privacy policy text, and acquiring a privacy policy sentence W related to data behaviors;

Step S102, defining data collection and data usage triplet extraction rules, Representing the collection of data object d by data receiver r, c represents whether or not to collect,/>Representing whether the data object d is used for the purpose of use p, k representing whether it is used, generating a data collection triplet dc and a data use triplet du from the privacy policy sentence W related to the data behavior using a large language model;

Step S103, detecting whether the data collection triples dc or the data use triples du conflict or not by using a large language model, and if so, judging that the data processing rules in the privacy policy text of the software S to be detected are inconsistent;

Step S104, aiming at privacy policy sentences W related to each data behavior, generating tasks capable of triggering the data processing behavior by using a large language model, and recording the generated task list as L;

Step S105, simulating a user to click a mobile APP interface by using a test input generator, inputting tasks in a task list L to a large language model one by one, analyzing an operation instruction by using the test input generator according to an instruction output by the large language model, executing corresponding actions, continuously and circularly executing until the corresponding tasks are completed in the software S to be tested, and capturing network data flow generated in the operation process by using a network analysis tool;

step S106, extracting a data flow triplet df from the network data traffic, Representing the actual behavior that the data receiver r collects the data object d and is used for the purpose of use p;

Step S107, comparing the data collection triplet dc obtained in the step S102 and the data use triplet du with the data stream triplet df obtained in the step S106, and if the data object d collecting behavior of the data receiver r in the data stream triplet df does not appear in the data collection triplet dc, judging that the privacy data collecting behavior and the privacy policy text of the software S to be tested are inconsistent; if the data object d in the data stream triplet df is used for the usage purpose p action not appearing in the data usage triplet du, the fact that the usage purpose of private data of the software S to be tested is inconsistent with the privacy policy text is judged.

2. The method for analyzing the consistency of the application privacy data using purpose based on the large language model as claimed in claim 1, wherein the specific method of the step S101 is as follows:

step S201, for the software S to be tested, acquiring privacy policy text thereof;

Step S202, dividing sentences in the privacy policy text according to punctuation marks, and storing the sentences which are mutually independent into a file A;

Step S203, a verb vocabulary list is created according to vocabulary word frequency of data collection or action occurrence in the privacy policy text, verb matching is carried out on the file A according to the verb vocabulary list, and privacy policy sentences W related to data actions are screened out.

3. The method for analyzing consistency of application privacy data using purpose based on large language model as claimed in claim 1, wherein the specific method in step S102 is as follows: the data collection and data usage triplet extraction rules are sent to the large language model, an example template is sent as an example for the large language model to learn, the large language model generates a data collection triplet dc and a data usage triplet du according to privacy policy sentences W related to data behaviors, and when processing involves multiple data objects, the data collection triplet dc and the data usage triplet du are divided into a plurality of data processing tuples which only comprise one data object.

4. The method for analyzing the consistency of the application privacy data using purpose based on the large language model as claimed in claim 1, wherein the specific method in the step S103 is as follows:

Step S401, the data collection triples dc are sent to a large language model, whether data collection behavior conflicts exist is detected, and if one data collection triples dc is used for collecting the data object d1 by the data receiver r1, and if the other data collection triples dc is used for not collecting the data object d1 by the data receiver r1, the data collection triples dc are the first conflicts; if one of the data collection triplets dc data receiver r2 collects data object d2, the other data collection triplet dc data receiver r2 does not collect data object d3, and if d3 includes d2, both are a second conflict; if at least one of the first conflict and the second conflict exists, judging that the internal data collection rule of the privacy policy text of the software S to be tested is inconsistent;

Step S402, sending the data usage triplet du to the large language model, detecting whether there is a data usage behavior conflict, if one of the data usage triplet du is used for the usage purpose p1 and the other data usage triplet du is not used for the usage purpose p1, then the data usage triplet du and the data usage triplet du are in a third conflict; if one of the data usage triples du is used for the purpose p2 and the other data usage triplet du is not used for the purpose p2, if d6 includes d5, then both are a fourth conflict; if at least one of the third conflict and the fourth conflict exists, judging that the use rules of the data in the privacy policy text W of the software S to be tested are inconsistent;

In step S403, all collected data objects in the data collection triplet dc are compared with all used data objects in the data usage triplet du, if the data usage triplet du uses data objects not in the data collection triplet dc, it is considered that there is an overdue usage data type conflict, and if there is an overdue usage data type conflict, it is determined that the privacy policy text W of the software S under test is inconsistent in overdue usage data type.

5. The method for analyzing the consistency of the application privacy data using purpose based on the large language model as claimed in claim 1, wherein the specific method in the step S105 is as follows:

Step S501, a test input generator simulates clicking operation of a user on a button on a screen interface of software S to be tested through random clicking, records the result of each clicking operation, and constructs a UI conversion chart UTG including the clicked button, interface elements and executed operations;

Step S502, the test input generator traverses all UI elements in the UI conversion chart and records option information;

Step S503, selecting a task from the task list L, converting the UI state and operation into an HTML format with structured information, transmitting the task, the current UI state description and the option information related to the task to a large language model, and giving a next operation instruction according to the input by the large language model;

Step S504, the test input generator analyzes the operation instruction and executes the corresponding action, after the execution is finished, the task, the current UI interface state description, the history action for executing the task and the option information related to the task are sent to the large language model, and the step S503 is executed circularly until the large language model returns the task completion instruction;

in step S505, network data traffic packets generated during the operation are captured using the network analysis tool.

6. The method for analyzing consistency of application privacy data using purpose based on large language model as claimed in claim 5, wherein the method in step S106 is as follows:

step S601, analyzing the network data traffic packet captured in the step S505 by using a network analysis tool, identifying and extracting structured data in the traffic packet, analyzing the identified structured data format, and extracting data in a key-value form to generate key-value pairs;

step S602: matching the preset information character string with the key value pair obtained in the step S601, extracting the successfully matched key value pair, and recording the key value in the key value pair as a data object d;

Step S603: the data receiver r and the destination p are obtained according to the destination URL, the transmission data and the application packet name in the network data traffic to generate a data stream triplet df.