CN113220877A - Privacy policy compliance detection method - Google Patents

Privacy policy compliance detection method Download PDF

Info

Publication number
CN113220877A
CN113220877A CN202110480404.5A CN202110480404A CN113220877A CN 113220877 A CN113220877 A CN 113220877A CN 202110480404 A CN202110480404 A CN 202110480404A CN 113220877 A CN113220877 A CN 113220877A
Authority
CN
China
Prior art keywords
data
privacy
label
privacy policy
gdpr
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110480404.5A
Other languages
Chinese (zh)
Inventor
刘爽
赵栢杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110480404.5A priority Critical patent/CN113220877A/en
Publication of CN113220877A publication Critical patent/CN113220877A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for detecting privacy policy compliance, which comprises the following steps: (1) extracting rules; combing logic content in a privacy protection law and defining a label according to the logic content by reading the content of the global privacy protection law (GDPR); (2) classifying texts; firstly, screening privacy policies to remove non-English and service clause data; marking each sentence of the screened privacy policy with a label to complete the construction of the corpus; then, a corpus is used for training a text classification model, the trained text classification model can classify sentences in any privacy policy, and a label is predicted for each sentence; (3) and (3) carrying out rule matching on the label content extracted from the rule in the step (1) and the text classification result in the step (2), namely the prediction label, so as to obtain a final compliance detection result.

Description

Privacy policy compliance detection method
Technical Field
The invention relates to the field of natural language processing technology and privacy policies, in particular to a privacy policy compliance detection method.
Background
In recent years, network and mobile applications have been rapidly developed and widely adopted in our daily lives. Thus, more and more personal data is provided to different application providers. Over the past few years, there have been many reports of privacy violations. One of the most influential cases is the pay-for-use. When the payment instrument generates 2017 year bill for the user, the user is in an authorization interfaceThere is a small line of words, "i agree to the sesame service agreement," and the "agree" button is clicked by default without explicitly informing the user. This behavior violates the General Data Protection Regulations (GDPR)[18](item 32) conditions on consent. Another case is a privacy policy named "ZAO" applying the deep learning technique face-changing APP, where it is specified that the user must agree to grant permanent rights irrevocable to ZAO and its associated companies (for their personal data)[17]. This violates the user's right to correct, delete, and object the data (chapter 2 of GDPR 13).
In addition to GDPR, other laws and regulations have been promulgated in different regions and countries, such as: data protection method[20]"private Consumer Law" of California, USA[19]And the like. Among them, GDPR is the most famous, and due to its large geographical range, some famous penalty cases have recently occurred. For example, Berlin data protection and information freedom experts have investigated Mobike (based on GDPR)[23]
Although there are legal regulations aimed at protecting personal data, it is difficult to know or check whether companies/organizations that collect, process, or store personal data of users have correctly complied with these legal regulations. There are two main reasons for the difficulty. First, similar to other laws and regulations, GDPR is written in natural language, contains a large number of legal specific terms, and is difficult for users without domain knowledge to understand. Second, privacy policies are often long documents written in natural language, which are very time consuming for App users to read. One previous study[22]It was concluded that if the us citizens read all privacy policies that were presented to them, an average of 40 minutes per day would be required.
Therefore, it is difficult for the application user to discover infringement behavior for himself. Furthermore, in some cases, the service developer/provider inadvertently violates laws/regulations due to lack of relevant knowledge. Therefore, there is a need to be able to automatically detect whether privacy policies comply with privacy preserving laws and regulations, such as GDPR.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a method for automatically detecting the compliance of a privacy policy based on classification and rules.
The purpose of the invention is realized by the following technical scheme:
a privacy policy compliance detection method comprises the following steps:
(1) extracting rules; combing logic content in a privacy protection law and defining a label according to the logic content by reading the content of the global privacy protection law (GDPR);
(2) classifying texts; firstly, screening privacy policies to remove non-English and service clause data; marking each sentence of the screened privacy policy with a label to complete the construction of the corpus; then, a corpus is used for training a text classification model, the trained text classification model can classify sentences in any privacy policy, and a label is predicted for each sentence;
(3) and (3) carrying out rule matching on the label content extracted from the rule in the step (1) and the text classification result in the step (2), namely the prediction label, so as to obtain a final compliance detection result.
Further, the logic content in step (1) is collected by the data controller the personal data of the data body and the information to be provided by the data controller to the data body; the information to be provided to the data subject by the data controller includes data storage period, purpose of processing data, contact information, right to access individual, right to modify or delete individual data, right to limit processing data, right to reject processing data, right to carry data, right to make claims.
Further, in the step (2), 10 privacy policy topic labels are formulated according to the corpus constructed by the GDPR; meanwhile, collecting corresponding privacy policies from popular applications on Google Play, and screening the privacy policies; the text classification model comprises SVM, LSTM and BERT.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
1. as is known from reading the literature, the present invention is the first to propose a supervised classification task of privacy policy sentences for the consistency detection problem. And a relevant corpus is contributed and disclosed, so that a vast number of scholars are helped to further research on privacy policy consistency detection and text classification tasks.
2. Through training, the model of this experiment modification: LSTM + LossW and BERT + LossW both achieve excellent effects on the corpus of the patent, and realize classification of sentence levels of privacy policy, and macro-f1 is 64.41% and 71.78%.
3. The consistency detection of the privacy policy and the GDPR is realized, 1180 problems can be detected in 304 privacy policies of the test set, and the accuracy rate reaches 90%.
4. The method analyzes the software privacy policy and the actual software behavior, detects the consistency of the software privacy policy and the actual software behavior, and finally compares the behavior result of the application software with the privacy policy. The method and the device perform consistency detection aiming at the APP privacy policy and the GDPR regulation, and the detection result of the method and the device is very helpful for users to know the own right, companies to judge the validity of the privacy policy and supervision of a supervision mechanism.
Drawings
FIG. 1a shows section 2 of the GDPR regulation chapter 13, and FIG. 1b shows an English version of ZAO privacy policy
FIG. 2 is a general work flow diagram
Fig. 3 is a network architecture diagram of BERT.
Fig. 4 shows the performance of the three models at different paragraph lengths (divided according to the number of sentences contained in the paragraph). The solid line is SVM, the dot-dash line is LSTM + LossW, and the dotted line is BERT + LossW.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The present embodiment proposes a method of automatically analyzing the contents of a privacy policy and reporting a violation of GDPR (chapter 13). See FIGS. 1a and 1b, GDPR chapter 13 was chosen because this section containsThe specification of which information must be provided to the data body and how that information is provided[21]Is most suitably embodied in a privacy policy. A classification scheme based on GDPR (clause 13) was designed and labeled for a corpus of 304 privacy policies. The corpus was then benchmarked with several standard classification models and the results were analyzed comprehensively. And finally, checking whether the privacy policy conforms to the GDPR chapter 13 according to the rules extracted from the GDPR and the classification result. 1180 compliance problems are detected in 304 privacy policy documents by the method, the compliance detection accuracy rate reaches 90%, and the recall rate reaches 91%.
In this work, the main task was to automatically analyze privacy policies. Three classical models are used for this purpose.
1. A Machine learning model Support Vector Machine (SVM);
SVM[2]as a machine learning classical model for solving a data classification task and a regression task, it has been widely used in the field of paragraph classification of natural language processing. The SVM may create a hyperplane separating different classes of data. TF-IDF using corpora in the present invention[7]The features train the SVM model.
2. Deep learning model Long Short-Term Memory (LSTM)
LSTM[3]Is a model well suited to processing sequence information. The learning principle of the model is similar to the human reading of natural language, and the model is performed in sequence and has a forgetting mechanism. In addition, context information is important, so bidirectional LSTM (BilTM) is used[5]) A better result is obtained.
3. Deep learning model Bidirectional Encoder retrieval from transformations (BERT)
See FIG. 3, BERT[4]Is a pre-training model which is most advanced in natural language processing tasks, and the model obtains the most advanced results on various natural language processing tasks[9-10]. The model captures the word and sentence level feature representations, respectively, using both Masked language model (predicting randomly occluded words) and predicting the next sentence.
The method for automatically detecting the compliance of the privacy policy based on classification and rules mainly comprises three steps of rule extraction, sentence classification and rule-based compliance analysis.
Step one, rule extraction;
this step is performed for the most stringent regulation of global privacy protection (GDPR), and it can be found by reading that the content specified by GDPR for privacy policy is mainly reflected in chapter 13, and the regulation specified in chapter 13 follows the logic "if a exists, B must exist (a → B)" where a represents the behavior of the data controller collecting personal data of the data body, and B represents information to be provided to the data body by the data controller. The definition of the tag is designed according to this logic. In this example, 9 logics are proposed, as shown in table 1.
TABLE 1 logic extracted from GDPR for consistency detection
Figure BDA0003049008790000041
Step two, text classification
Each sentence of the privacy policy is divided into 11 tags defined. A tag may be automatically provided for each sentence of the privacy policy.
(201) Constructing a corpus; after the GDPR promulgates, the content of the privacy policy has changed greatly. The existing corpus which can be used for the task does not exist, most of the existing corpus is not labeled according to GDPR regulations, and the existing corpus cannot be used for a classification task of consistency detection. Therefore, a corpus needs to be created, and each sentence of the privacy policy is labeled with a tag, so as to help analyze the consistency of the privacy policy and the relevant regulations. OPP-115 corpus according to GDPR regulations[1]And other privacy policy research content, 10 privacy policy topic tags were formulated. Meanwhile, corresponding privacy policies are collected from popular applications on Google Play, the privacy policies are screened, and data such as non-English and service terms are removed. Then collectedThe data invites the annotator to annotate to obtain a corpus needed for the experiment.
(202) The privacy policy is automatically classified using a classification model.
The present embodiment uses SVM, LSTM, and BERT models to classify the labeled data. A corpus is used to train a text classification model. After training is completed, sentences in any privacy policy can be classified, and a label is predicted for each sentence, which is performed by using a supervised learning method. To achieve this, a corpus of 36610 sentences is first created from 304 privacy policies based on a labeling scheme designed by the rule extraction step. Then, a classification model is trained based on the label data to complete the sentence classification task.
Step three, rule matching
And performing rule matching according to the label content extracted from the rule in the step one and the text classification result, namely the prediction label in the step two, so as to obtain a final compliance detection result. The tagged sentences may be tested for compliance of the privacy policy with the GDPR according to the logic in Table 1.
Specifically, the process of the present embodiment performing the specific operations through the above detailed steps is as follows:
first, according to GDPR terms and referring to OPP-115 corpus, other privacy policy research results make 11 labels relevant to the privacy policy consistency detection. The following is the tag details:
collect personal data (CPI): information is collected that identifies the individual identity of the data subject. [ GDPR 13.1]
Data storage lifespan (DRP): the data control side stores the term of personal information. [ GDPR 13.2(a) ]
Destination of processing Data (DPP): the purpose of processing personal data. [ GDPR 13.1(c) ]
Contact Details (CD): data controllers and data protection officers. [ GDPR 13.1(a) (b) ]
Rights to access personal data (RA): the data body can apply to the data controller the right to access their personal data. [ GDPR 13.2(b) ]
Right to modify or delete personal data (RRE): the data body can apply to the data controller the right to modify or erase their personal data. [ GDPR 13.2(b) ]
Right to limit processing data (RRP): the data body can present data controllers with the right to limit the processing of their personal data. [ GDPR 13.2(b) ]
The right to reject the processed data (ROP): the data body can present rights to the data controller that deny processing of their personal data. [ GDPR 13.2(b) ]
Data carrying Rights (RDP): the data entity can present the right to transfer their personal data to another controller. [ GDPR 13.2(b) ]
Claim for claims (RLC): the data body issues a right to complain to the regulatory body. [ GDPR 13.2(d) ]
Others: tags that do not belong to any of the above categories.
1313 android application privacy policy links were crawled from the popular application of the android application store, Google Play, using the script framework. From these links, privacy policy text content was dynamically crawled using script and Selenium. After filtering, noise data such as non-English and service terms are removed, too short privacy policies are filtered, and the rest 304 privacy policies are handed to the labeling personnel for labeling. The last 304 privacy policies included a total of 36610 annotated completed sentences. Adopting and customizing open-source text span marking tool YEDDA[24]To complete the annotation task.
22 volunteers are recruited in the embodiment and are local students and researchers in the legal and computer science professions, the English level is good, and the system is responsible for marking privacy policies. To control the labeling quality, training of labeling tasks was first performed on volunteers. The volunteers are given a short tutorial and provided with labeled example sentences to clarify the meaning of each label. After training is finished, the volunteers are required to label a small part of privacy policies, the quality of labeling results is checked, and if the users have wrong understanding, the users can clarify the policies uniformly. Through the process, all volunteers are guaranteed to have clear understanding on the meaning of the label, and therefore the labeling quality is controlled. Each sentence was labeled independently by 3 volunteers. Each volunteer is assigned a set of privacy policies that they individually label the assigned task. Each volunteer requires 40 minutes on average to annotate a privacy policy. After all volunteers finish their own labeling tasks, the 3 volunteers with the same privacy policy are required to meet and merge the labels. According to the standard procedure, if 3 volunteers all give the same label, then this label will be the final label of the sentence. Otherwise, they will proceed until a consensus is reached.
TABLE 2 Classification statistics for annotated corpora
Figure BDA0003049008790000061
The details of the markup corpus are shown in Table 2. The "frequency" column shows the overall count of the corpus for the corresponding label in the corpus. The "coverage" column indicates the coverage of the corresponding tag, i.e., the percentage of the privacy policy document that contains the tag. The "word average" column is the average number of words per sentence in the corpus. The last column is the Fleiss' Kappa labeled results (before merging). The two tags, the destination for processing Data (DPP) and the collection of personal data (CPI), appear most frequently in the privacy policy. Other categories, such as rights to access personal data (RA), rights to limit handling of data (RRP), data carrying Rights (RDP) and claiming Rights (RLC), are mentioned much less frequently in privacy policies. It should be noted that there is a special tag, i.e. "other", that contains all sentences that are not in the 10 tags. Since the corpus focuses on labeling the entire privacy policy document, all sentences in the privacy policy document are explicitly labeled. The "other" category accounts for 84% of the total sentence.
The Fleiss ' Kappa is an index for measuring the consistency of the labeling result, the higher the value of the Fleiss ' Kappa is, the more consistent the labeling of the data by the three volunteers is, the value of the Fleiss ' Kappa is between 0.45 and 0.57, and the data are labeled according to Landis and the like[11-14]Is markedQuasi, belong to moderate consistency levels. Through a comprehensive analysis of the annotated sentences, three potential causes were found that resulted in a moderate level of consent. (1) The data is not balanced. The partial sentences with GDPR-related tags, i.e., the top 10 tags in table 2, account for 16% of the total sentence number. (2) Each single label has different sentence descriptions; (3) for some categories, such as "collect personal data (CPI)", some sentences are ambiguous to the annotator due to the lack of explicit boundaries for personal and non-personal information. According to GDPR clause 4(1), "personal data refers to any information about a natural person that has been identified or recognized". There is no quantitative definition and therefore the annotator may give a subjective decision. The following is an example, and some annotators mistakenly mark them as a "collect personal data (CPI)" category. "When you use Service, output servers automatic record log file information, including your web request, browser type, referrer/exit pages and URLs, number of copies and how you interact with the links on the Service, domain names, mapping pages, pages view, and other find information. "
Secondly, the present embodiment uses three classical classification models of SVM, LSTM, BERT. Dividing the marked corpus into a training set, a verification set and a test set according to the proportion of 8:1:1, and simultaneously adopting a ten-fold cross-validation method during model training. Using an open source tool scimit-lern[15]The tool creates an SVM model where the kernel function is a linear function and the penalty factor is 1.0.
Both deep learning models LSTM and BERT use the pitorch 1.2.0 toolkit.
For the LSTM model, glove was used[16]100-dimensional word vectors pre-trained in 2014 have a vocabulary of 400 k. The encoder uses Bi-LSTM, a bidirectional LSTM model, which can capture the characteristics of the input sequence from both the positive and negative directions, the learning rate is 2e-4, and the batch processing size is set to 4. The hidden layer dimension is 128 dimensions.
For the BERT model, a pre-trained BERT model (BERT-Base, uncased)) is used to obtain a feature representation method of the sentence. Wherein the learning rate of BERT is5e-5, the optimizer Adam[8]Optimizer, batch size 4.
Also for both the LSTM and BERT models, the original loss function used is the cross-entropy loss. Because the number of each label in the corpus of the embodiment has the problem of unbalance, the loss function of the model is adjusted to be the weighted loss function, so that a better effect is obtained.
Finally, the experimental results obtained by this example are shown in Table 3, wherein the classification results of the three models and the macroaverage (macro) result of the total classification result are shown in Table 3, wherein P, R and F represent the accuracy, recall and F1 values, respectively, and the labels are replaced by the abbreviations used in the previous definition of the labels, and the best value of each label has been indicated in bold. From the results, it can be seen that BERT performed best on the F1 scale overall, followed by LSTM, and SVM performed the worst, noting that SVM performed the highest on accuracy and the worst on recall. Some classes, such as "Collect personal data (CPI)", are more difficult than others because this class is more ambiguous than others, and this can be reflected in the manual labeling process, especially, the "Collect personal data (CPI)" is similar to "process data Destination (DPP)", and they also have many same keywords, such as the word "Collect personal information", but "Collect personal data (CPI)" focuses more on the way of collection, and "process data Destination (DPP)" focuses more on the purpose of collection. Both tags are difficult for all models.
TABLE 3SVM, LSTM, BERT Cross-validation Classification results
Figure BDA0003049008790000081
It is also observed that the effect of the weighting loss function on both LSTM and BERT is improved. The F index is improved by more than 5% for LSTM and by more than 4% for BERT. In comparison, BERT performs better on unbalanced data, and the lifting of the weight loss function is weaker at BERT than at LSTM. This is because BERT contains more semantic information at the sentence level during pre-training, and thus the model itself performs better on unbalanced data. For LSTM, all semantic information is derived from the training set, so LSTM performs worse on unbalanced data, increasing the weighted Loss, and better improving performance on unbalanced corpora. In addition, the results of the BERT model are inherently better and little improvement is in mind.
It should also be noted that there is another class that accounts for up to 84% of the corpus, but is not relevant to the assay consistency test and can affect the classification results for the other 10 labels. This consistency detection task places more emphasis on recall than accuracy and therefore it is more desirable to be able to identify all sentences belonging to the 10 tags and therefore the model with the higher recall for the 10 tags is more efficient. It can be seen that the SVM has the highest accuracy and lowest recall, and the addition of the weighted loss function increases LSTM by more than 10% recall and BERT by approximately 8% recall.
The embodiment uses the model of BERT + LossW for consistency check, because it shows the highest and most stable performance, the trained model is tested for consistency check on 304 privacy policies, and the logic of the test follows the rules in table 1.
The consistency check must follow the rule shown in the table, i.e. a → B, and if this condition is not met, the check result is false and the corresponding inconsistency problem is reported, which accurately reports 1180 problems among 1164 real existing problems in the test set. Among the reported problems, the problem with the highest frequency of occurrence is "right to access personal data (RA)" and "right to restrict processing data (RRP)", both of which occur more than 180 times.
Of the 107 problems that were missed, 73 were due to the fact that the model did not accurately identify the part of B in the rule, resulting in problems that were missed and not detected.
In order to further observe the results of different models, the expressions of different models on different sentence lengths are also explored, and the relationship between the F index and the sentence length of the three models is compared, and is shown in fig. 4. It can be observed that the performance of SVM gradually decreases as the sentence length increases and decreases very quickly after a length of 40. The main reason is that the SVM only focuses on information near in sentences, learning ability for long sentences is weak, and the deep learning model can perform better for long sentences.
In summary, the following aspects are proposed in the present invention for compliance analysis. GDPR (article 13) and privacy policy. By designing a labeling scheme based on the GDPR clause 13, 304 corpora are created manually. Sentences in the privacy policy are predicted by a classifier, and then compliance analysis based on rules is carried out, and the classification result is based on the above. This method succeeded in detecting 1180 problems in 304 privacy policy documents.
Reference documents:
[1]Wilson,S.,Schaub,F.,Dara,A.A.,Liu,F.,Cherivirala,S.,Leon,P.G.,...&Norton,T.B.(2016,August).The creation and analysis of a website privacy policy corpus.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers)(pp.1330-1340).
[2]Cortes,C.,&Vapnik,V.(1995).Support-vector networks.Machine learning,20(3),273-297.
[3]Hochreiter,S.,&Schmidhuber,J.(1997).Long short-term memory.Neural computation,9(8),1735-1780.
[4]Devlin,J.,Chang,M.W.,Lee,K.,&Toutanova,K.(2018).Bert:Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805.
[5]Graves,A.,Jaitly,N.,&Mohamed,A.R.(2013,December).Hybrid speech recognition with deep bidirectional LSTM.In 2013IEEE workshop on automatic speech recognition and understanding(pp.273-278).IEEE.
[6]Chang,C.,Li,H.,Zhang,Y.,Du,S.,Cao,H.,&Zhu,H.(2019,June).Automated and personalized privacy policy extraction under GDPR consideration.In International Conference on Wireless Algorithms,Systems,and Applications(pp.43-54).Springer,Cham.
[7]Ramos,J.(2003,December).Using tf-idf to determine word relevance in document queries.In Proceedings of the first instructional conference on machine learning(Vol.242,pp.133-142).
[8]Kingma,D.P.,&Ba,J.(2014).Adam:Amethod for stochastic optimization.arXiv preprint arXiv:1412.6980.
[9]Sathyendra,K.M.,Wilson,S.,Schaub,F.,Zimmeck,S.,&Sadeh,N.(2017,September).Identifying the provision of choices in privacy policy text.In Proceedings of the 2017Conference on Empirical Methods in Natural Language Processing(pp.2774-2779).
[10]Sun,C.,Qiu,X.,Xu,Y.,&Huang,X.(2019,October).How to fine-tune bert for text classification?.In China National Conference on Chinese Computational Linguistics(pp.194-206).Springer,Cham.
[11]Landis,J.R.,&Koch,G.G.(1977).The measurement of observer agreement for categorical data.biometrics,159-174.
[12]Tesfay,W.B.,Hofmann,P.,Nakamura,T.,Kiyomoto,S.,&Serna,J.(2018,April).I read but don't agree:Privacy policy benchmarking using machine learning and the eu gdpr.In Companion Proceedings of the The Web Conference 2018(pp.163-166).
[13]Lebanoff,L.,&Liu,F.(2018).Automatic detection of vague words and sentences in privacy policies.arXiv preprint arXiv:1808.06219.
[14]Linden,T.,Khandelwal,R.,Harkous,H.,&Fawaz,K.(2020).The privacy policy landscape after the GDPR.Proceedings on Privacy Enhancing Technologies,2020(1),47-64.
[15]Pedregosa,F.,Varoquaux,G.,Gramfort,A.,Michel,V.,Thirion,B.,Grisel,O.,...&Vanderplas,J.(2011).Scikit-learn:Machine learning in Python.the Journal of machine Learning research,12,2825-2830.
[16]Pennington,J.,Socher,R.,&Manning,C.D.(2014,October).Glove:Global vectors for word representation.In Proceedings of the 2014conference on empirical methods in natural language processing(EMNLP)(pp.1532-1543).
[17] ZAO privacy policy violation report https:// www.sohu.com/a/339247803_100165512
[18] Https:// GDPR-info
[19] Private laws of consumers https:// oag.ca. gov/privacy/ccpa
[20] Data protection method https:// www.gov.uk/data-protection
[21]Gerl,A.,&Meier,B.(2019).The Layered Privacy Language Art.12–14GDPR Extension–Privacy Enhancing User Interfaces.Datenschutz und Datensicherheit-DuD,43(12),747-752.
[22]McDonald,A.M.,&Cranor,L.F.(2008).The cost of reading privacy policies.Isjlp,4,543.
[23]
https://medium.com/@a.hanff/chinas-surveillance-social-credit-system-alive-kicking-in-berlin-6c2b3b10b197
[24]Yang,J.,Zhang,Y.,Li,L.,&Li,X.(2017).YEDDA:A lightweight collaborative text span annotation tool.arXiv preprint arXiv:1711.03759.
The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (3)

1. A method for detecting privacy policy compliance, comprising the steps of:
(1) extracting rules; combing logic content in a privacy protection law and defining a label according to the logic content by reading the content of the global privacy protection law (GDPR);
(2) classifying texts; firstly, screening privacy policies to remove non-English and service clause data; marking each sentence of the screened privacy policy with a label to complete the construction of the corpus; then, a corpus is used for training a text classification model, the trained text classification model can classify sentences in any privacy policy, and a label is predicted for each sentence;
(3) matching rules; and (3) carrying out rule matching on the label content extracted from the rule in the step (1) and the text classification result in the step (2), namely the prediction label, so as to obtain a final compliance detection result.
2. The method of claim 1, wherein the logic of step (1) is implemented by the data controller collecting personal data of the data body and information provided to the data body by the data controller; the information to be provided to the data subject by the data controller includes data storage period, purpose of processing data, contact information, right to access individual, right to modify or delete individual data, right to limit processing data, right to reject processing data, right to carry data, right to make claims.
3. The method of claim 1, wherein 10 privacy policy topic tags are formulated in step (2) according to GDPR and OPP-115 corpus; meanwhile, collecting corresponding privacy policies from popular applications on Google Play, and screening the privacy policies; the text classification model comprises SVM, LSTM and BERT.
CN202110480404.5A 2021-04-30 2021-04-30 Privacy policy compliance detection method Pending CN113220877A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110480404.5A CN113220877A (en) 2021-04-30 2021-04-30 Privacy policy compliance detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110480404.5A CN113220877A (en) 2021-04-30 2021-04-30 Privacy policy compliance detection method

Publications (1)

Publication Number Publication Date
CN113220877A true CN113220877A (en) 2021-08-06

Family

ID=77090375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110480404.5A Pending CN113220877A (en) 2021-04-30 2021-04-30 Privacy policy compliance detection method

Country Status (1)

Country Link
CN (1) CN113220877A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112068844A (en) * 2020-09-09 2020-12-11 西安交通大学 APP privacy data consistency behavior analysis method facing privacy protection policy
CN112131385A (en) * 2020-09-15 2020-12-25 天津大学 Structure analysis method of privacy policy
CN112364165A (en) * 2020-11-12 2021-02-12 上海犇众信息技术有限公司 Automatic classification method based on Chinese privacy policy terms

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112068844A (en) * 2020-09-09 2020-12-11 西安交通大学 APP privacy data consistency behavior analysis method facing privacy protection policy
CN112131385A (en) * 2020-09-15 2020-12-25 天津大学 Structure analysis method of privacy policy
CN112364165A (en) * 2020-11-12 2021-02-12 上海犇众信息技术有限公司 Automatic classification method based on Chinese privacy policy terms

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHUANGLIU: "Have You been Properly Notified? Automatic Compliance Analysis of Privacy Policy Text with GDPR Article 13", 《30TH WORLD WIDE WEB CONFERENCE (WWW)》 *

Similar Documents

Publication Publication Date Title
Brown et al. What are you saying? Using topic to detect financial misreporting
Huang et al. Identifying self-admitted technical debt in open source projects using text mining
Zhao et al. Document embedding enhanced event detection with hierarchical and supervised attention
Risch et al. Domain-specific word embeddings for patent classification
Liu et al. Have you been properly notified? automatic compliance analysis of privacy policy text with gdpr article 13
Zhou et al. Recognizing software bug-specific named entity in software bug repository
Barua et al. Multi-class sports news categorization using machine learning techniques: resource creation and evaluation
Shcherban et al. Automatic identification of code smell discussions on stack overflow: A preliminary investigation
Wang et al. Personalizing label prediction for github issues
Lazaridou et al. Discovering biased news articles leveraging multiple human annotations
US20230047800A1 (en) Artificial intelligence-assisted non-pharmaceutical intervention data curation
Dunn et al. Stability of syntactic dialect classification over space and time
Ardimento et al. Predicting bug-fix time: Using standard versus topic-based text categorization techniques
Devisree et al. A hybrid approach to relationship extraction from stories
Asadi Kakhki et al. Topic detection and document similarity on financial news
Sancheti et al. Agent-specific deontic modality detection in legal language
CN113220877A (en) Privacy policy compliance detection method
Rahat et al. Automated detection of gdpr disclosure requirements in privacy policies using deep active learning
Mohammed et al. Improved VSM based candidate retrieval model for detecting external textual plagiarism
Kemik et al. BLM-17m: A Large-Scale Dataset for Black Lives Matter Topic Detection on Twitter
Sabbah et al. Self-admitted technical debt classification using natural language processing word embeddings
Amariles et al. Compliance generation for privacy documents under GDPR: A roadmap for implementing automation and machine learning
Ahmed et al. Evaluation of descriptive answers of open ended questions using NLP techniques
Zhang et al. A semantic search framework for similar audit issue recommendation in financial industry
Jeronimo et al. Computing with subjectivity lexicons

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210806