CN113282955B - Method, system, terminal and medium for extracting privacy information in privacy policy - Google Patents

Method, system, terminal and medium for extracting privacy information in privacy policy Download PDF

Info

Publication number
CN113282955B
CN113282955B CN202110609050.XA CN202110609050A CN113282955B CN 113282955 B CN113282955 B CN 113282955B CN 202110609050 A CN202110609050 A CN 202110609050A CN 113282955 B CN113282955 B CN 113282955B
Authority
CN
China
Prior art keywords
privacy
data
privacy policy
policy
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110609050.XA
Other languages
Chinese (zh)
Other versions
CN113282955A (en
Inventor
朱浩瑾
魏程涌潇
陈哲轩
周路
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202110609050.XA priority Critical patent/CN113282955B/en
Publication of CN113282955A publication Critical patent/CN113282955A/en
Application granted granted Critical
Publication of CN113282955B publication Critical patent/CN113282955B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The invention provides a method and a system for extracting privacy information in a privacy policy, which are based on natural language processing, firstly, an original privacy policy is processed into a text, then, the privacy text is segmented and divided into sentences, a natural language processing model obtained through pre-training is used for carrying out part-of-speech tagging and named entity recognition on the sentences to obtain data objects, finally, a synonym dictionary and fuzzy matching are utilized to obtain normalized data objects, then the data objects are mapped into corresponding privacy information classifications, the classification of the privacy information collected by the privacy policy is obtained, and therefore users, application market platforms or supervision authorities can know the privacy information collection condition of the applied privacy policy and can be helped to make the next decision. A corresponding terminal and medium are also provided. According to the invention, the privacy information in the privacy policy is extracted by using a natural language processing technology, manual marking is not needed, more efficient, rapid and flexible privacy analysis can be realized, and the requirements of related industries are met.

Description

Method, system, terminal and medium for extracting privacy information in privacy policy
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method, a system, a terminal and a medium for extracting privacy information in a privacy policy based on natural language processing.
Background
With the development of the times, the application of the mobile internet has penetrated into the aspects of daily life of people. However, the mobile internet application provides convenience to people and collects privacy information of a large number of users. In order to standardize the collection of the privacy information of the application, relevant laws and policies are disputed at home and abroad, and the application is required to provide a clear privacy policy to inform the user of which privacy information is to be collected before the user uses the application. Therefore, a privacy policy is often characterized by a professionalism, accuracy, and the like similar to legal provisions. Accompanying these characteristics, they are also long, complicated, and obscure. Whether the user, the application market platform or the relevant regulatory agency, it is necessary to consume a lot of manpower to manually find out the collected privacy information from the privacy policy. An efficient automatic privacy information extraction tool can help a user to know which privacy information is collected by an application, and can help an application market platform or related supervision authorities to improve the efficiency of investigation.
The text of the privacy policy tends to be quite complex. On one hand, the writing of the privacy policy is to conform to relevant regulations, so that the privacy policy has strong normative and professionality and is usually mainly long-sentence-based; on the other hand, since relevant regulations do not have a certain requirement for description of privacy information, privacy policies of different applications often vary widely. Traditionally, to know which private information is included in a private policy, a professional auditor is required to perform manual auditing. This approach is not only labor intensive, but also time consuming.
In recent years, natural language processing techniques have evolved rapidly, particularly in large-scale text analysis, syntactic analysis, and named entity recognition. For privacy policy textual data, the key to extracting the privacy information in which collection is declared is to locate the sentence that is relevant to collection. Generally, when the privacy policy declares information collection, such a binary (collection class or sharing class behavior verb, data object) appears. Therefore, technologies such as named entity recognition and part-of-speech tagging using natural language processing have great potential. However, the following technical problems still remain to be solved by directly applying the natural language processing technology to the collection of privacy information of the privacy policy:
one, which words are setting forth the collection of private information?
Second, which belong to private information?
Thirdly, whether the description of the private information is uniform? If not uniform, how should normalization be done?
At present, no explanation or report of the similar technology of the invention is found, and similar data at home and abroad are not collected.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method, a device and a terminal for extracting privacy information in a privacy policy based on natural language processing, which apply a natural language processing technology to privacy policy automation analysis.
According to one aspect of the invention, a method for extracting privacy information in a privacy policy is provided, which comprises the following steps:
acquiring original data of applied privacy policies, performing data processing on the original data of the privacy policies in different formats to obtain privacy policy data in a universal text format, and performing sentence segmentation on the obtained privacy policy data in the universal text format to disassemble the obtained privacy policy data into a plurality of independent sentences;
carrying out extension training on an existing natural language processing model by utilizing a sentence which is extracted in advance and used for describing privacy information to obtain a language processing model in the privacy policy field, carrying out part-of-speech tagging and named entity recognition on each sentence by utilizing the obtained language processing model in the privacy policy field, and further screening out sentences containing behavioral verbs and data objects to obtain a (behavioral verbs, data objects) binary set;
normalizing all data objects in the obtained binary group set, establishing a general classification of the privacy data, mapping the data objects obtained by normalization to the corresponding classification to obtain the category of the privacy information collected by the analyzed privacy policy statement, and finally realizing the extraction of the privacy information in the privacy policy.
Preferably, the original data of the applied privacy policy is obtained, and the privacy policy link is crawled from the application market in a crawler mode to obtain the original webpage data of the privacy policy of each application.
Preferably, the format of the original data of the privacy policy webpage is an HTML format or a PDF format.
Preferably, the sentence dividing of the obtained privacy policy data in the universal text format is performed to split the obtained privacy policy data into a plurality of separate sentences, including: and (3) dividing and disassembling the privacy policy data in the general text format into a plurality of independent sentences according to the end punctuations of the sentences by adopting a general sentence division method based on a natural language processing technology.
Preferably, the screening out the sentences including the behavioral verbs and the data objects to obtain a set of (behavioral verbs, data objects) tuples includes:
firstly, carrying out syntactic analysis and named entity recognition on each screened sentence; if no collecting class or sharing class behavior verb appears in the part-of-speech tagging result of the syntactic analysis, the sentence is omitted; otherwise, continuously checking whether the result of the named entity identification contains the data object, and if not, omitting the sentence; the last sentence has a corresponding (action verb, data object) duplet, and then a set of (action verb, data object) duplets is obtained.
Preferably, the normalizing all data objects in the obtained set of duplets comprises:
and normalizing the data objects by utilizing a synonym dictionary and fuzzy matching method according to the set of the data objects to obtain a normalized data object result.
Preferably, the establishing a general classification of the private data includes:
obtaining a general classification of the private data according to the relevant privacy regulations;
wherein:
the privacy-related provision comprises: according to the relevant regulations of the privacy protection act of consumers in California and the privacy protection act.
Preferably, the method further comprises:
submitting the category of the privacy information collected by the privacy policy statement.
According to another aspect of the present invention, there is provided a system for extracting private information in a privacy policy, including:
the data acquisition and preprocessing module is used for acquiring original data of the applied privacy policies, performing data processing on the original data of the privacy policies in different formats to obtain privacy policy data in a universal text format, and performing sentence segmentation on the obtained privacy policy data in the universal text format to be disassembled into a plurality of independent sentences;
the system comprises a part-of-speech tagging and named entity recognition module, a part-of-speech tagging and named entity recognition module and a semantic analysis module, wherein the part-of-speech tagging and named entity recognition module is used for carrying out extension training on an existing natural language processing model by utilizing pre-extracted sentences for describing privacy information to obtain a language processing model in the privacy policy field, carrying out part-of-speech tagging and named entity recognition on each sentence by utilizing the obtained language processing model in the privacy policy field, and further screening out sentences containing behavioral verbs and data objects to obtain a (behavioral verbs and data objects) binary set;
and the privacy information classification module is used for normalizing all data objects in the obtained binary set, establishing a general classification of privacy data, mapping the data objects obtained by normalization to the corresponding classification to obtain the category of the privacy information collected by the analyzed privacy policy statement, and finally realizing the extraction of the privacy information in the privacy policy.
Preferably, the system further comprises:
and the privacy information extraction result submitting module is used for submitting the categories of the privacy information obtained by the privacy information classifying module.
According to a third aspect of the present invention, there is provided a terminal comprising a memory, a processor and a computer program stored on the memory and operable on the processor, the processor being operable when executing the computer program to perform the method of any of the above.
According to a fourth aspect of the invention, there is provided a computer-readable storage medium, having stored thereon a computer program, which, when executed by a processor, is operable to perform the method of any of the above.
Due to the adoption of the technical scheme, compared with the prior art, the invention has the following beneficial effects:
the method, the system, the terminal and the medium for extracting the privacy information in the privacy policy, which are provided by the invention, can automatically analyze the privacy policy on a large scale based on natural language processing, through the technologies and methods of part-of-speech tagging, named entity recognition and the like in the field of natural language processing, a (collected or shared class behavior verb, data object) binary group is obtained, then obtaining a normalized data object by means of synonym dictionary and fuzzy matching, finally mapping the normalized data object into different privacy information categories according to the mapping relation between the privacy information categories obtained from related laws or regulations and the data object, finally obtaining the categories of the privacy information stated by the privacy policy, therefore, efficient, rapid and flexible privacy policy analysis is realized, and the investigation requirements of users, application market platforms and related supervision departments on the privacy policies are met.
Compared with the prior art, the method, the system, the terminal and the medium for extracting the private information in the private policy can automatically, efficiently and accurately extract the private information declared and collected in the private policy in a shorter time without manually marking data.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
fig. 1 is a flowchart of a method for extracting privacy information in a privacy policy according to an embodiment of the present invention.
Fig. 2 is a flowchart of a method for extracting privacy information in a privacy policy according to a preferred embodiment of the present invention.
Fig. 3 is a schematic diagram illustrating modules of a system for extracting privacy information in a privacy policy according to an embodiment of the present invention.
Detailed Description
The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and gives a detailed implementation mode and a specific operation process. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.
Fig. 1 is a flowchart of a method for extracting privacy information in a privacy policy according to an embodiment of the present invention.
As shown in fig. 1, the method for extracting privacy information in a privacy policy provided in this embodiment may include the following steps:
s100, acquiring original data of the applied privacy policy, performing data processing on the original data of the privacy policy in different formats to obtain privacy policy data in a universal text format, and performing sentence segmentation on the obtained privacy policy data in the universal text format to disassemble the obtained privacy policy data into a plurality of independent sentences;
s200, carrying out extension training on an existing natural language processing model by utilizing a sentence which is extracted in advance and used for describing privacy information to obtain a language processing model in the privacy policy field, carrying out part-of-speech tagging and named entity recognition on each sentence by utilizing the obtained language processing model in the privacy policy field, and further screening out sentences containing behavioral verbs and data objects to obtain a (behavioral verbs and data objects) binary set;
s300, normalizing all data objects in the obtained (behavior verb, data object) binary set, establishing a general classification of privacy data, mapping the data objects obtained by normalization to the corresponding classification, obtaining the classification of the privacy information collected by the analyzed privacy policy statement (namely, the behavior that App transmits the information of the user or the data generated by the user to a server side, the purpose of the behavior is to provide certain services for the user by a service provider or to carry out statistical research by the service provider, and the like), and finally realizing the extraction of the privacy information in the privacy policy.
In this embodiment, as a preferred embodiment, the method may further include the steps of:
and S400, submitting the obtained privacy information type.
In S100 of this embodiment, as a preferred embodiment, the original privacy policy data is obtained, and a crawler manner may be adopted to crawl the privacy policy link from the application market to obtain the original privacy policy web page data of each application. However, the format of the privacy policy raw web page data is HTML or PDF format, which is not suitable for further research, so in S100 of this embodiment, the raw web page data is preprocessed to obtain the privacy policy raw data in text format.
In S200 of this embodiment, as a preferred embodiment, the method for obtaining a corresponding (collection-class or sharing-class behavior verb, data object) binary group for each selected sentence may include the following steps:
firstly, carrying out syntactic analysis on each screened sentence, and if no collected or shared behavior verb appears in a part-of-speech tagging result of the syntactic analysis, omitting the sentence; otherwise, continuously checking whether the named entity identification analysis result contains the data object, and if not, omitting the sentence; and finally, obtaining a set containing (collection class or sharing class behavior verb and data object) duplets.
In S200 of this embodiment, as a specific application example, collecting the class behavior verb includes: ask, gather, check, knock, use, obtain, access, receive, gather, store, save, require, process, build, request, retain, and the like.
In S200 of this embodiment, as a specific application example, the sharing class behavior verb includes: share, sell, provide, trade, transfer, give, distribute, disconnect, send, read, exchange, report, transmit, post, etc.
In S300 of this embodiment, as a preferred embodiment, the (collection class or sharing class behavior verb, data object) is converted into the corresponding privacy information category according to the (collection class or sharing class behavior verb, data object) binary set, by means of synonym dictionary and fuzzy matching, and privacy information classification obtained based on laws or regulations.
According to the method for extracting the privacy information in the privacy policy, provided by the embodiment, aiming at the gap of privacy policy analysis, a natural language processing technology is applied to text analysis of the privacy policy, a privacy policy text is disassembled, a (collection-type or sharing-type behavior verb, data object) binary group is obtained by carrying out named entity recognition and part-of-speech tagging on a sentence, and the binary group is converted into a corresponding privacy information category through mapping, so that a user, an application market platform and a related supervision mechanism can be helped to quickly know which categories of information are collected by the privacy policy, and further, the next-step decision making is helped.
In some embodiments of the invention:
in S100, the privacy policy raw data comes from a privacy policy link provided in an application introduction of an application market (e.g., hua shi application market, google application market, etc.).
Wherein:
firstly, discovering seed applications from hot applications in an application market, and then performing a crawling strategy of breadth traversal on each seed application: its similar application or related recommended application is added to the tail end of the crawling queue. Finally, the crawl queue is continuously updated until the number of applications no longer increases or the number of applications reaches a given target value.
In S100 of this embodiment, as a preferred embodiment, two tools, HTML2text and PDF2text, are used to pre-process the original privacy policies in HTML format and PDF format. Preprocessing to obtain the privacy policy in a text format. The text needs to be claused at this preprocessing step, which can cause the NLP parser to falsely detect a sentence break or falsely tag part of speech due to the formatted list in the text. These errors will have a negative impact on the final result, and therefore, each item in the formatting list is merged with the leading clause before the formatting list in S100 to form a new sentence. Such as:
We will collect your:
1.phone number
2.email address
3.name
will be recombined into three sentences: "well collection your phone number", "well collection your email address" and "well collection your name".
In S200, the existing model of spaCy was used. The named entity recognition part in the existing spaCy model is realized by using a deep learning method, and the part-of-speech tagging part is realized by using a statistical method. However, since the existing spaCy model is not specific to the privacy policy domain, S200 needs to select some linguistic data of the privacy policy domain to perform extended training on the existing spaCy model. In order to adapt the existing model to the field of privacy policy, 500 sentences are selected as training data in S200, and the existing model is run on the training sentences to prevent the model from forgetting the original labeled information. After a model of the privacy policy field is obtained, the model is used for carrying out syntactic analysis on the sentence obtained in the S100, and if no collecting or sharing behavior verb appears in the part-of-speech tagging result of the syntactic analysis, the sentence is omitted; otherwise, continuously checking whether the named entity identification analysis result contains the data object, and if not, omitting the sentence; and finally, obtaining a set containing (collection class or sharing class behavior verb and data object) duplets.
It should be noted that the 500 sentences related to the collection of the privacy information randomly selected from the privacy policy data set are described above.
In S300, the binary set obtained in S200 (collection class or sharing class behavior verb, data object) is used. It should be noted that privacy policies sometimes use sentences with negative meanings to indicate that they will not collect or share certain data. This step therefore also includes the detection of negative meanings. If a word with a negative meaning, such as not, no, etc., is detected, then the data object in this bin is considered to be not collected. But from a practical point of view, the sharing behavior must obviously be based on the collecting behavior. Only if certain data is collected can it be discussed whether or not the data is shared. Thus, merely negating the sharing action would still consider the data object to be collected. Then, the private information text needs to be normalized through synonym dictionary and fuzzy matching. The part of the synonym dictionary uses an existing synonym dictionary, and meanwhile, on the basis of the existing synonym dictionary, new words appearing in the privacy policy text can be manually added. And finally, mapping the data object to the corresponding privacy classification according to the privacy classification obtained from the related laws or regulations to obtain the category of the privacy information collected by the privacy policy.
Fig. 2 is a flowchart of a method for extracting privacy information in a privacy policy according to a preferred embodiment of the present invention.
As shown in fig. 2, the method for extracting privacy information in a privacy policy provided in the preferred embodiment, which is a method for extracting privacy information in a privacy policy based on natural language processing, applies a natural language processing technique to privacy policy analysis, and may include the following steps:
step 1, acquiring original data of a privacy policy, preprocessing the original data of the privacy policy in an HTML or PDF format to obtain privacy policy data in a text format, and separating and disassembling the privacy policy data in the text format into a plurality of independent sentences according to end punctuations;
step 2, pre-training to obtain a model for natural language processing in the privacy policy field, performing part-of-speech tagging and named entity recognition on the separated sentences obtained by disassembling by using the model, further screening out sentences with collected or shared behavior verbs, and obtaining a corresponding (collected or shared behavior verbs, data objects) binary group for each screened sentence;
and 3, according to the (collection type or sharing type behavior verb, data object) binary group, firstly, obtaining a normalized data object by means of synonym dictionary and fuzzy matching, and finally, converting the normalized data object into the category of the privacy information according to the mapping relation between the privacy information category obtained from related laws or regulations and the data object.
As a preferred embodiment, the method may further comprise the steps of:
and 4, submitting the privacy information categories extracted in the step 3. The technical solutions provided by the preferred embodiments are further described in detail below with reference to the accompanying drawings.
As shown in fig. 2, the method provided by the preferred embodiment mainly includes the following three steps, which are respectively data acquisition and preprocessing, natural language processing (including part-of-speech tagging and named entity recognition), and classification of privacy information, and may further include: and submitting a privacy information result.
Specifically, the method comprises the following steps:
step 1, data acquisition and preprocessing are carried out, privacy policy original data are acquired, original privacy policy data in an HTML or PDF format are preprocessed, privacy policy data in a general text format are obtained, and the privacy policy data in the text format are disassembled into a single sentence;
step 2, natural language processing, namely performing extension training on an existing natural language processing model by using a sentence which is manually picked in advance and describes privacy information to obtain a natural language processing model related to the field of privacy policies, performing part-of-speech tagging and named entity recognition on the sentence obtained in the step 1 by using the model, screening out sentences with collected behavior verbs, and obtaining a corresponding (collected or shared behavior verbs, data objects) binary group for each screened sentence;
step 3, classifying the privacy information, and carrying out normalization operation on the privacy information text through synonym dictionary and fuzzy matching according to the data object in the binary group obtained in the step 2; the part of the synonym dictionary uses an existing synonym dictionary, and meanwhile, on the basis of the existing synonym dictionary, new words appearing in the privacy policy text can be manually added. And finally, mapping the data object to the corresponding privacy classification according to the privacy classification obtained from the related laws or regulations to obtain the category of the privacy information collected by the privacy policy.
The functions are respectively as follows:
data acquisition and preprocessing: and crawling privacy policy links provided in an application market, crawling original webpage data of privacy policies of all applications, preprocessing the data, and preparing for a next analysis model.
And (3) natural language processing: in order to adapt the existing spaCy model to the text in the field of privacy policies, a corpus related to the privacy policies is used for carrying out extended training on the existing model, so that the accuracy of the existing model in part of speech tagging and named entity recognition is improved. And then, performing text analysis on the privacy policy text data obtained in the step 1 by using a natural language processing model obtained by pre-training, and extracting (collecting class or sharing class behavior verbs and data objects) binary groups from the text data.
Classifying the privacy information: and carrying out mapping operation on the data objects in the binary group extracted in the last step. Firstly, normalization operation is carried out on the private information text through synonym dictionary and fuzzy matching. Finally, according to the privacy classification obtained from the related law or regulation, the normalized data object is mapped to the corresponding privacy classification to obtain the category of the privacy information collected by the privacy policy.
And (3) submitting a privacy information result: after the extraction of the privacy information is finished, the system returns the result of the extraction of the privacy information, namely the category of the privacy information collected by the privacy policy.
In order to ensure readability, the embodiment of the present invention will carefully illustrate the extraction scheme.
1. Data acquisition and preprocessing: relevant regulations in different regions such as middle America and Europe are regulated, and clear privacy policies are required to be provided for applications. Therefore, in the mainstream application market in each region, the application is required to provide the privacy policy link. The original data of the privacy policy web page of the application can be obtained by the link of the privacy policy provided by the application in the application market. But the original data of the web page contains many irrelevant elements, such as CSS elements, script elements, etc. in HTML. If the elements are not removed, the efficiency of text analysis of the privacy policy is influenced, and the accuracy of the analysis is also influenced. Therefore, in the data acquisition and preprocessing part, the privacy policy data in the HTML format is converted into the text format through the HTML2txt tool. When acquiring the privacy policy, it is found that not all privacy policies are presented in the form of HTML, and some privacy policies are presented to the user in the form of PDF files. Therefore, in order to be able to deal with this part of the privacy policy as well, we convert the privacy policy in PDF format into text format through PDF2txt tool. And then the formatted list is disassembled and combined, so that the form of the list is more beneficial to the next privacy policy analysis.
2. Part of speech tagging and named entity recognition: at present, no existing model in the privacy policy field exists, and the existing model in the general field is selected, so that the existing model in the privacy policy field is not good enough. To completely retrain a model in the privacy policy domain, a large corpus and training time are required. Therefore, on the basis of the existing model, a small amount of privacy policy corpus is used for carrying out extension training on the existing model, so that the advantages of the existing model can be kept, the applicability of the existing model in the privacy policy field can be improved, and the scheme is the scheme with the highest cost performance. After a natural language processing model in the field of the privacy policy is obtained, traversing each sentence after the sentence division of the privacy policy, performing part-of-speech tagging and named entity recognition, further screening out the sentences with collected behavioral verbs, and obtaining a corresponding (collected or shared behavioral verb, data object) binary group for each screened sentence.
3. Classifying the privacy information: since different privacy policies are likely to adopt different descriptions for the same privacy information, such as email and email address, all refer to the same thing in practice. Therefore, after extracting the data object text, the data object needs to be normalized by the synonym dictionary before classifying the data object obtained in the previous step. And finally, mapping the data object to the privacy information category according to the category of the privacy information obtained by the method of the relevant law or regulation and the attribution relationship between the data object and the privacy information category, and finally obtaining the category of the privacy information collected by the privacy policy.
4. And (3) submitting a privacy information result: the privacy information extraction system can display the privacy information types declared to be collected by the detected application in the privacy policy to a user after receiving the privacy information types, and if the detected application is a large-scale application set, the statistical analysis result of the application set privacy information collection condition can be output, so that the user can further analyze and confirm the application set privacy information.
The method for extracting the privacy information in the privacy policy provided by the preferred embodiment is based on a natural language processing technology, and includes the steps of firstly processing the original privacy policy in an HTML (hypertext markup language) format or a PDF (portable document format) format into a universal text format, then segmenting the privacy text, performing part-of-speech tagging and named entity recognition on the sentence by using a natural language processing model obtained through pre-training, extracting data objects mentioned in the sentence by using results of the part-of-speech tagging and named entity recognition, and obtaining the privacy information category collected by the privacy policy through normalization and classification operations, so that a user, an application market platform or a supervision agency can know the privacy information collection condition of the applied privacy policy and can be helped to make a next decision. According to the preferred embodiment, the privacy information in the privacy policy is extracted by using a natural language processing technology, manual labeling is not needed, more efficient, rapid and flexible privacy analysis can be realized, and the requirements of related industries are met.
Fig. 3 is a schematic diagram illustrating constituent modules of a privacy policy and privacy information extraction system according to an embodiment of the present invention.
As shown in fig. 3, the privacy policy and privacy information extraction system provided in this embodiment may include:
according to another aspect of the present invention, there is provided a system for extracting private information in a privacy policy, including: the system comprises a data acquisition and preprocessing module, a part-of-speech tagging and named entity identification module and a classification module; wherein:
the data acquisition and preprocessing module is used for acquiring original data of the applied privacy policies, performing data processing on the original data of the privacy policies in different formats to obtain privacy policy data in a universal text format, and performing sentence segmentation on the obtained privacy policy data in the universal text format to be disassembled into a plurality of independent sentences;
the system comprises a part-of-speech tagging and named entity recognition module, a word tagging and named entity recognition module and a voice recognition module, wherein the part-of-speech tagging and named entity recognition module is used for carrying out extension training on an existing natural language processing model by utilizing sentences extracted in advance and used for describing privacy information to obtain a language processing model in a privacy policy field, carrying out part-of-speech tagging and named entity recognition on each sentence by utilizing the obtained language processing model in the privacy policy field, and further screening out sentences containing behavioral verbs and data objects to obtain a set of (behavioral verbs and data objects) binary groups;
and the privacy information classification module is used for normalizing all data objects in the obtained binary set, establishing a general classification of privacy data, mapping the data objects obtained by normalization to the corresponding classification to obtain the category of the privacy information collected by the analyzed privacy policy statement, and finally realizing the extraction of the privacy information in the privacy policy.
As a preferred embodiment, the system may further include:
and the privacy information extraction result submitting module is used for submitting the categories of the privacy information obtained by the privacy information classifying module.
In some embodiments of the invention:
the data acquisition and preprocessing module is used for acquiring the original data of the privacy policy, processing the original data into a plain text format, combining the formatted lists, and finally disassembling the privacy policy into independent sentences;
the part-of-speech tagging and named entity recognition module firstly performs extended training on an existing natural language processing model, and can increase the applicability of the existing model to the field of privacy policies on the basis of the existing model. And then, performing part-of-speech tagging and named entity recognition on the sentence by using the extended natural language processing model obtained by pre-training. If no collecting class or sharing class behavior verb appears in the part-of-speech tagging result of the syntactic analysis, the sentence is omitted; otherwise, whether the analysis result contains the data object is continuously checked, and if the analysis result does not contain the data object, the sentence can be omitted. Finally, a corresponding (collection class or sharing class behavior verb, data object) binary set is obtained;
and the privacy information classification module is used for normalizing the data object by means of the synonym dictionary and fuzzy matching according to the binary group obtained by the previous module, and then mapping the normalized data object to the privacy class according to the privacy classification obtained from the related laws or regulations and the mapping relation between the privacy class and the data object to obtain the privacy information class collected by the privacy policy.
And the privacy information extraction result submitting module is used for submitting the privacy information extraction result, namely the privacy information category collected by the privacy policy.
An embodiment of the present invention provides a terminal, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the processor can be configured to perform the method in any one of the above embodiments of the present invention.
Optionally, a memory for storing a program; a Memory, which may include a volatile Memory (RAM), such as a Random Access Memory (SRAM), a Static Random Access Memory (SRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), and the like; the memory may also comprise a non-volatile memory, such as a flash memory. The memories are used to store computer programs (e.g., applications, functional modules, etc. that implement the above-described methods), computer instructions, etc., which may be stored in partition in the memory or memories. And the computer programs, computer instructions, data, etc. described above may be invoked by a processor.
A processor for executing the computer program stored in the memory to implement the steps of the method according to the above embodiments. Reference may be made in particular to the description relating to the preceding method embodiment.
The processor and the memory may be separate structures or may be an integrated structure integrated together. When the processor and the memory are separate structures, the memory, the processor may be coupled by a bus.
An embodiment of the invention provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any of the above-mentioned embodiments of the invention.
The method, the system, the terminal and the medium for extracting the privacy information in the privacy policy provided by the embodiment of the invention are mainly divided into three parts, namely a data acquisition and preprocessing part, a part of speech tagging and named entity identification part and a classification part. The part of speech tagging and named entity recognition solves the problem mentioned in the background technology, namely, a collection class or sharing class verb and a data object are used as marks, and sentences which possess the two contents at the same time are used for explaining the collection of private information. The classification module solves the problems two and three mentioned in the background technology: firstly, summarizing the category of the private information from related laws or regulations, obtaining the cognition of the user on the private information in a crowdsourcing mode, and combining the two to obtain the category of the private information; and then constructing a synonym word library of the data object, and putting synonyms such as 'location' and 'gps' together, thereby realizing the normalization of the data object.
According to the method, the system, the terminal and the medium for extracting the privacy information in the privacy policy, which are provided by the embodiment of the invention, based on natural language processing, a natural language processing technology is applied to text analysis of the privacy policy, the privacy policy text is disassembled, a (collection class or sharing class behavior verb, data object) binary group is obtained by carrying out named entity recognition and part-of-speech tagging on a sentence, the extracted data object is normalized according to a synonym dictionary and fuzzy matching, and finally the extracted data object is mapped to the corresponding privacy information category. The method can help users, application market platforms and related supervision authorities to quickly know which types of information are collected by the privacy policy, so that the users, the application market platforms and the related supervision authorities can be helped to make the next decision. According to the method, the system, the terminal and the medium for extracting the private information in the private policy, provided by the embodiment of the invention, the private policy text is analyzed by using a natural language processing technology, data do not need to be marked manually, the private information declared and collected in the private policy can be automatically, efficiently and accurately extracted, and the requirements of related industries are met.
It should be noted that, the steps in the method provided by the present invention may be implemented by using corresponding modules, devices, units, and the like in the system, and those skilled in the art may implement the composition of the system by referring to the technical solution of the method, that is, the embodiment in the method may be understood as a preferred example for constructing the system, and will not be described herein again.
Those skilled in the art will appreciate that, in addition to implementing the system and its various devices provided by the present invention in purely computer readable program code means, the method steps can be fully programmed to implement the same functions by implementing the system and its various devices in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices thereof provided by the present invention can be regarded as a hardware component, and the devices included in the system and various devices thereof for realizing various functions can also be regarded as structures in the hardware component; means for performing the functions may also be regarded as structures within both software modules and hardware components for performing the methods.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims (10)

1. A method for extracting privacy information in a privacy policy is characterized by comprising the following steps:
acquiring original data of applied privacy policies, performing data processing on the original data of the privacy policies in different formats to obtain privacy policy data in a universal text format, and performing sentence segmentation on the obtained privacy policy data in the universal text format to disassemble the obtained privacy policy data into a plurality of independent sentences;
carrying out extension training on an existing spaCy model by utilizing a sentence which is extracted in advance and used for describing privacy information to obtain a language processing model in a privacy policy field, carrying out part-of-speech tagging and named entity recognition on each sentence by utilizing the obtained language processing model in the privacy policy field, and further screening out sentences containing behavioral verbs and data objects to obtain a (behavioral verbs, data objects) binary set;
normalizing all data objects in the obtained binary group set, establishing a general classification of the privacy data, mapping the data objects obtained by normalization to the corresponding classification to obtain the category of the privacy information collected by the analyzed privacy policy statement, and finally realizing the extraction of the privacy information in the privacy policy.
2. The method for extracting privacy information in privacy policies according to claim 1, wherein the privacy policy original data of the applications are obtained, and the privacy policy links are crawled from an application market in a crawler manner to obtain the privacy policy original web page data of each application.
3. The method of claim 2, wherein the original data of the privacy policy page is in an HTML format or a PDF format.
4. The method according to claim 1, wherein the parsing the obtained privacy policy data in the universal text format into a plurality of separate sentences comprises: and (3) dividing and disassembling the privacy policy data in the general text format into a plurality of independent sentences according to the end punctuations of the sentences by adopting a general sentence division method based on a natural language processing technology.
5. The method for extracting privacy information in privacy policy according to claim 1, wherein the screening out sentences containing action verbs and data objects to obtain a set of (action verbs, data objects) duplets comprises:
firstly, carrying out syntactic analysis and named entity recognition on each screened sentence; if no collecting class or sharing class behavior verb appears in the part-of-speech tagging result of the syntactic analysis, the sentence is omitted; otherwise, continuously checking whether the result of the named entity identification contains the data object, and if not, omitting the sentence; the last sentence has a corresponding (action verb, data object) duplet, and then a set of (action verb, data object) duplets is obtained.
6. The method of claim 1, wherein normalizing all data objects in the obtained set of duplets comprises:
normalizing the data objects by utilizing a synonym dictionary and fuzzy matching method according to the set of the data objects to obtain a normalized data object result;
and/or
The establishing of the general classification of the private data comprises the following steps:
a generic classification of the private data is obtained according to the privacy-related provisions.
7. The method for extracting private information in a privacy policy according to any one of claims 1 to 6, further comprising:
submitting the category of the privacy information collected by the privacy policy statement.
8. A system for extracting private information in a privacy policy, comprising:
the data acquisition and preprocessing module is used for acquiring original data of applied privacy policies, performing data processing on the original data of the privacy policies with different formats to obtain privacy policy data with a universal text format, performing sentence division on the obtained privacy policy data with the universal text format, and splitting the obtained privacy policy data into a plurality of independent sentences;
the system comprises a part-of-speech tagging and named entity recognition module, a word tagging and named entity recognition module and a word analysis module, wherein the part-of-speech tagging and named entity recognition module is used for performing extended training on an existing spaCy model by utilizing pre-extracted sentences used for describing privacy information to obtain a language processing model in a privacy policy field, performing part-of-speech tagging and named entity recognition on each sentence by utilizing the obtained language processing model in the privacy policy field, and further screening out sentences containing behavioral verbs and data objects to obtain a (behavioral verbs and data objects) binary set;
and the privacy information classification module is used for normalizing all data objects in the obtained binary set, establishing a general classification of privacy data, mapping the data objects obtained by normalization to the corresponding classification to obtain the category of the privacy information collected by the analyzed privacy policy statement, and finally realizing the extraction of the privacy information in the privacy policy.
9. A terminal comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the computer program, when executed by the processor, is operable to perform the method of any of claims 1 to 6.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 6.
CN202110609050.XA 2021-06-01 2021-06-01 Method, system, terminal and medium for extracting privacy information in privacy policy Active CN113282955B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110609050.XA CN113282955B (en) 2021-06-01 2021-06-01 Method, system, terminal and medium for extracting privacy information in privacy policy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110609050.XA CN113282955B (en) 2021-06-01 2021-06-01 Method, system, terminal and medium for extracting privacy information in privacy policy

Publications (2)

Publication Number Publication Date
CN113282955A CN113282955A (en) 2021-08-20
CN113282955B true CN113282955B (en) 2022-07-08

Family

ID=77282965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110609050.XA Active CN113282955B (en) 2021-06-01 2021-06-01 Method, system, terminal and medium for extracting privacy information in privacy policy

Country Status (1)

Country Link
CN (1) CN113282955B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723085A (en) * 2021-08-26 2021-11-30 北京航空航天大学 Pseudo-fuzzy detection method in privacy policy document
CN113742773A (en) * 2021-08-31 2021-12-03 平安普惠企业管理有限公司 Privacy bullet frame detection method, device, equipment and storage medium
CN115994379A (en) * 2021-10-20 2023-04-21 华为技术有限公司 Privacy protocol family generation method of application, client device and server
CN114297700B (en) * 2021-11-11 2022-09-23 北京邮电大学 Dynamic and static combined mobile application privacy protocol extraction method and related equipment
CN115630357B (en) * 2022-10-26 2023-09-22 四川大学 Method for judging behavior of collecting personal information by application program crossing boundary
CN115712839B (en) * 2022-11-14 2023-10-24 国网山东省电力公司日照供电公司 Automatic matching system and method for relay protection device communication model

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9244909B2 (en) * 2012-12-10 2016-01-26 General Electric Company System and method for extracting ontological information from a body of text
CN109271626B (en) * 2018-08-31 2023-09-26 北京工业大学 Text semantic analysis method
CN110297961A (en) * 2019-06-26 2019-10-01 广州博士信息技术研究院有限公司 A kind of Quick Acquisition of policy information and optimization extracting method
CN110827159B (en) * 2019-11-11 2023-11-03 上海交通大学 Financial medical insurance fraud early warning method, device and terminal based on relation diagram
CN112364165A (en) * 2020-11-12 2021-02-12 上海犇众信息技术有限公司 Automatic classification method based on Chinese privacy policy terms
CN112257114A (en) * 2020-12-02 2021-01-22 支付宝(杭州)信息技术有限公司 Application privacy compliance detection method, device, equipment and medium

Also Published As

Publication number Publication date
CN113282955A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN113282955B (en) Method, system, terminal and medium for extracting privacy information in privacy policy
CN111309912A (en) Text classification method and device, computer equipment and storage medium
CN107102993B (en) User appeal analysis method and device
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
CN112163424A (en) Data labeling method, device, equipment and medium
US11893537B2 (en) Linguistic analysis of seed documents and peer groups
CN112749284A (en) Knowledge graph construction method, device, equipment and storage medium
CN112231494B (en) Information extraction method and device, electronic equipment and storage medium
CN112000802A (en) Software defect positioning method based on similarity integration
CN110889275A (en) Information extraction method based on deep semantic understanding
CN110781669A (en) Text key information extraction method and device, electronic equipment and storage medium
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN113312476A (en) Automatic text labeling method and device and terminal
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN108363700A (en) The method for evaluating quality and device of headline
CN110737770B (en) Text data sensitivity identification method and device, electronic equipment and storage medium
CN112257444B (en) Financial information negative entity discovery method, device, electronic equipment and storage medium
CN112183093A (en) Enterprise public opinion analysis method, device, equipment and readable storage medium
CN114238735B (en) Intelligent internet data acquisition method
CN114706948A (en) News processing method and device, storage medium and electronic equipment
CN114036266A (en) Intelligent strategy volume-combining method, device and equipment based on natural language processing
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
CN109597879B (en) Service behavior relation extraction method and device based on 'citation relation' data
Gayen et al. Automatic identification of Bengali noun-noun compounds using random forest
CN113962196A (en) Resume processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant