CN112364165A

CN112364165A - Automatic classification method based on Chinese privacy policy terms

Info

Publication number: CN112364165A
Application number: CN202011261262.5A
Authority: CN
Inventors: 朱璋颖; 陆亦恬; 唐祝寿
Original assignee: Shanghai Benzhong Information Technology Co ltd
Current assignee: Shanghai Benzhong Information Technology Co ltd
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2021-02-12

Abstract

The invention provides an automatic classification method based on Chinese privacy policy terms, which belongs to the technical field of natural language processing and comprises the following steps: data processing: acquiring privacy policies of a plurality of applications as a data set, manually labeling the data set to obtain a data set with a label, and then cleaning the data set to obtain a training sample data set; training data: selecting features of the training sample data set, selecting effective features capable of identifying different clauses and categories, and establishing a detection model; determining whether the privacy policy text has integrity. According to the automatic classification method based on the Chinese privacy policy terms, provided by the invention, through automatic classification based on the privacy policy terms, the privacy policy contents are quickly classified under various classification type attributes, so that convenience is brought to reading and understanding of a user, meanwhile, the completeness detection of the privacy policy terms is realized, and the user can quickly identify whether the privacy policy is complete or not.

Description

Automatic classification method based on Chinese privacy policy terms

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to an automatic classification method based on Chinese privacy policy terms.

Background

With the increasing number of people using APP, the leakage events of the privacy data of users emerge endlessly. The APP privacy policy is a statement that discloses how user data is collected, used, shared, and managed, and is an autonomous measure for the APP operator to collect user information. However, the privacy policy is long in space, difficult to understand and long in reading time, and most users agree to the privacy policy without reading directly, so that the problem in the privacy policy is likely to be a vulnerability of privacy security of the users. Aiming at the above phenomena, in order to help the user to read the Chinese privacy policy and reflect the quality of the privacy policy, the contents of the Chinese privacy policy terms need to be automatically classified.

In the prior art, the content of the Chinese privacy policy is analyzed by a content analysis method. Through classifying and counting the contents of the Chinese privacy policy, the analysis dimension characteristics and the mutual relation are summarized, and the comparison is carried out according to the research target so as to obtain the conclusion about the current situation of the Chinese privacy policy and the like. The encoding is a key step of content analysis, but encoding a large amount of content is cumbersome, and meanwhile, due to the fact that an artificial encoding process generates errors, the intrinsic effectiveness of content analysis is low. Therefore, a simple and effective detection method is needed to find out the problems existing in the chinese privacy policy quickly, accurately and automatically, and improve the readability of the chinese privacy policy for the user to read conveniently.

Disclosure of Invention

The invention aims to provide an automatic classification method based on Chinese privacy policy terms, and aims to solve the technical problems that in the prior art, a large amount of contents are coded more complexly, and the content analysis has low intrinsic effectiveness.

In order to achieve the purpose, the invention adopts the technical scheme that: the automatic classification method based on the Chinese privacy policy terms comprises the following steps:

data processing: the method comprises the steps of obtaining a plurality of applied privacy policies as data sets, marking the terms of the privacy policies to obtain data sets with labels, and then cleaning the data sets to obtain training sample data sets;

training data: selecting features of the training sample data set, selecting effective features capable of identifying different clauses and categories, training a classifier based on the feature vectors of the clauses and the categories, and establishing a detection model;

and (3) data detection: receiving a privacy policy text through the detection model, classifying the clause content of the privacy policy text under various types of attributes, and judging whether the privacy policy text has integrity.

Further, the data processing includes:

acquiring data;

establishing a data marking standard according to the requirements of laws and regulations, wherein the data marking standard comprises all terms required to be completely covered by privacy policies in the laws and regulations;

labeling the data;

and removing noise words in the data, and performing word segmentation processing by using a word segmentation tool to obtain a clause data set with a label after word segmentation.

Further, the data annotation criteria comprises a number of classification categories, wherein the classification categories include at least one of first party collection/use, sharing/transfer/disclosure with third parties, data security, user access/editing/deletion methods, term changes, terms facing a particular demographic group, and other general information.

Further, the data annotation standard contains 7 classification categories, 50 attributes and 91 values.

Further, the data training comprises:

and performing feature selection on the training sample data set through a TF-IDF algorithm, wherein the calculation formula is as follows:

TF-IDF＝TF×IDF

for the ith word ti, the TF formula is:

in the above formula, n_i,jIs the word t_iAt jth file d_jThe denominator is in the file d_jSum of the occurrence numbers of all words in, n_k,jPresentation document d_jThe k-th word in the document d_jNumber of occurrences of, tf_i,jIndicates the word t_iIn document d_jThe word frequency of (1);

the IDF formula is:

wherein idf_iIndicates the word t_iThe reverse file frequency of (2);

d represents the total number of files in the corpus;

|{j:t_i∈d_jdenotes the word t is included_iThe number of files.

Further, the data detection includes:

and (3) calculating classification probability: calculating a support vector machine classifier trained by each category i in the privacy policy text, and predicting the probability of y being i, wherein i being (1,2,3 …, k) k is the number of categories;

selecting categories: for a given new input x, taking one classification class with the highest probability of predicting y to i by the classifier trained by each classification class as the classification class of the new input x.

Further, the word segmentation tool is a Jieba word segmentation tool.

Further, removing the noise words in the data by adopting a Hadamard decommissioning vocabulary.

The automatic classification method based on the Chinese privacy policy terms provided by the invention has the beneficial effects that: compared with the prior art, the automatic classification method based on the Chinese privacy policy terms, disclosed by the invention, has the advantages that the privacy policy contents are quickly classified under various classification type attributes through automatic classification based on the privacy policy terms, so that convenience is brought to reading and understanding of a user, meanwhile, the completeness detection of the privacy policy terms is realized, and the user can quickly identify whether the privacy policy is complete or not.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flowchart of an automatic classification method based on Chinese privacy policy terms according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a classification stage of a support vector machine in an automatic classification method based on Chinese privacy policy terms according to an embodiment of the present invention;

fig. 3 is a schematic diagram of input and output of three stages of support vector machine classification in the automatic classification method based on the chinese privacy policy clause according to the embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1 to fig. 3, an automatic classification method based on the terms of the chinese privacy policy according to the present invention will now be described. The automatic classification method based on the Chinese privacy policy terms comprises the following steps:

s1, data processing: acquiring a plurality of applied privacy policies as a data set, labeling the terms of the privacy policies to obtain a data set with a label, and then cleaning the data set to obtain a training sample data set;

the applications are obtained from the application marketplace or download channels in various application official websites.

The process of cleansing the data set includes: word segmentation and de-noising of the sound words.

The specific implementation of the step can be as follows:

s1.1, acquiring data;

privacy policies are obtained by web crawlers for mobile application markets or/and internet websites.

More specifically, the privacy policy of the 100 popular applications in the application market is obtained through the web crawler.

S1.2, establishing a data marking standard according to the requirements of laws and regulations, wherein the data marking standard comprises all terms required to be completely covered by privacy policies in the laws and regulations;

the data annotation criteria comprises a number of classification categories, wherein the classification categories include at least one of first party collection/use, sharing/transfer/disclosure with third parties, data security, user access/editing/deletion methods, term changes, terms facing a particular demographic group, and other general information.

More specifically, the data annotation criteria contained 7 classification categories, 50 attributes, and 91 values. Where the classification category represents a basic classification, such as: the First Party collects the usage, shares with the Third Party, etc., and respectively represents the usage, the sharing with the Third Party, etc., by the tags of First-Party-Collect-Use, Third-Party-Share, etc. The attribute represents the specific content corresponding to the basic classification, for example, the collection and Use of the First Party are continuously divided into collection purposes, the selection of the User and the like, and the First Party is represented by tags such as First-Party-Collection-User-Purpose, First-Party-Collection-User-Collection and the like. For the attribute, corresponding value options are designed, for example, whether the interactive attribute sets two values of 'yes' and 'no'.

S1.3, marking the data;

on the basis of fully understanding the classification standard, the privacy policy is labeled by using an online labeling tool BRAT. The consistency of the data annotation is tested by Cohen's kappa coefficient, and the content of the data annotation is proved to be credible.

More specifically, this step ultimately resulted in a data set containing 100 chinese privacy policy terms, including 11,440 category and attribute tags.

And S1.4, removing noise words in the data, and performing word segmentation processing by using a word segmentation tool to obtain a clause data set with a label after word segmentation.

And calling the stop word list to remove noise from the privacy policy terms, reducing noise words in the privacy policy and improving the subsequent classification effect. Illustratively, the stop word list is a Hadamard stop word list.

And performing word segmentation processing on the cleaned data through a Jieba word segmentation tool to obtain a clause data set with a label after word segmentation. The noise words are words which appear frequently and have no practical meaning, such as "and", "even", etc.

S2, training data: selecting features of a training sample data set, selecting effective features capable of identifying different clauses and categories, training a classifier based on the feature vectors of the clauses of each category, and establishing a detection model;

the specific implementation of the step can be as follows:

selecting features of a training sample data set through a TF-IDF algorithm, and selecting effective features capable of identifying different clause classification categories;

the calculation formula is as follows:

TF-IDF＝TF×IDF

for the ith word ti, the TF formula is:

the IDF formula is:

wherein idf_iIndicates the word t_iThe reverse file frequency of (2);

d represents the total number of files in the corpus;

|{j:t_i∈d_jdenotes the word t is included_iThe number of files.

The classification algorithm mainly adopted by the invention is a support vector machine algorithm.

The data set is a multi-label data set and is different from the general two-classification problem, so that a classifier is constructed by adopting an One-vs-all strategy, particularly, an OneVsRestClassifier in a scinit-left toolkit is adopted for implementation, and under the condition of small sample number and large feature number, a linear support vector machine is considered, and a kernel function is used for mapping a finite-dimensional space to a high-dimensional space, so that the finite-dimensional space can be linearly classified.

The specific method comprises the following steps:

marking one of the classes as positive (y 1) and then all others as negative, this model is denoted as

Then, similarly, the second class is selected to be marked as the positive-going class (y ═ 2), and the other classes are marked as the negative-going classes, and the model is marked as

And so on.

Finally, a series of models are obtained, which are abbreviated as:

where i is (1,2,3 …, k), and k is the number of categories.

When prediction is needed, all classifiers are run once, and then the output variable with the highest probability is selected for each input variable. Finally according toAnd (3) training a support vector machine classifier by a one-vs-all strategy:

where i corresponds to each possible y-i, a new value of x is input for making the prediction.

Inputting x in each classification model, selecting one of the order

Maximum i, i.e.

S3, data detection: and receiving the privacy policy text through the detection model, classifying the clause contents of the privacy policy text under various types of attributes, and judging whether the privacy policy text has integrity.

The specific implementation of the step can be as follows:

s3.1, calculating classification probability: calculating a probability that a support vector machine classifier trained by each classification category i in a privacy policy text predicts y-i, wherein i-i (1,2,3 …, k) k is the number of categories;

s3.2, selecting the categories: for a given new input x, taking one classification class with the highest probability of predicting y to i by the classifier trained by each classification class as the classification class of the new input x.

Based on which a privacy policy integrity check is performed.

Compared with the prior art, the automatic classification method based on the Chinese privacy policy terms, provided by the invention, has the advantages that the privacy policy contents are quickly classified under various classification type attributes through automatic classification based on the privacy policy terms, so that convenience is brought to reading and understanding of a user, meanwhile, the completeness detection of the privacy policy terms is realized, and the user can quickly identify whether the privacy policy is complete or not.

The invention provides a specific implementation mode, which comprises the following steps:

the method comprises the steps of obtaining Chinese privacy policy data from Huashi application markets, defining privacy policy term classification standards according to relevant laws and regulations, determining the content of privacy policy terms into 7 classification categories, and determining corresponding attributes under each category. The classification categories include: first party collection/use, sharing/transfer/disclosure with third parties, data security, user access/editing/deletion methods, terms change, terms facing specific groups of people, other general information;

corresponding attributes under each category: for example, the collection/use continuation of the first party is divided into collection purposes, user selection and the like, the sharing/transfer/disclosure continuation with the third party is divided into a sharing mode, constraints on the third party and the like, the data security continuation is divided into security measures, data storage time limits and the like, the user access/edit/delete method is divided into operation ways, operations which can be performed by the user and the like, the term change continuation is divided into change reasons, informing modes and the like, the terms facing a specific group are divided into user selection, supplier actions and the like, and other general information is divided into privacy policy application scope, operator information and the like.

And manually labeling terms according to the classification standard to obtain a labeled Chinese privacy policy data set. The data set is divided into two parts, one part is used for constructing a classifier, and the other part is used for detecting the accuracy of the model.

And for the marked Chinese privacy policy data, performing word segmentation on the marked clauses of each classification category through a Jieba word segmentation tool, and separating word groups by using blank spaces. And the Chinese privacy policy data introduces a deactivation word list at the same time, and words with high occurrence frequency and no practical meaning such as 'the', 'even' and the like are deleted.

The privacy policy data contains' if a person sends out overseas transmission of personal information in the process of using overseas transaction service, after obtaining your authorization agreement alone, the overseas receiver is ensured to process your personal information according to the policy description and strict security measures. For example, the result after word segmentation is [ in the process of using the overseas transaction service, if a overseas transmission of personal information is sent, after obtaining the authorization agreement of your alone, it is ensured that the overseas receiver has to process your personal information according to the policy description and strict security measures. The result after deleting the noise word is [ sending information in overseas transaction service, and overseas output authorization meaning ensuring overseas receiving information of strict security measures of root policy instructions for processing ].

And calculating the weight of each phrase in the document vector by adopting a TF-IDF algorithm, comparing the sizes of the phrase weights, and arranging the phrases from large to small according to the weights to obtain the keywords. The obtained keywords have good category distinguishing capability and can be used as characteristic attributes of the category.

Calculating the formula: TF-IDF ═ TF X IDF

Wherein, the TF term frequency represents the frequency of the term appearing in the document d; the frequency representation of the IDF reverse file is a measure of the general importance of a word, and if the number of documents containing the entry t is less, the IDF is larger, so that the entry t has good category distinguishing capability. The Term Frequency (TF) formula is: for the ith word t_iIn the case of a composite material, for example,

IDF represents the Inverse Document Frequency (IDF), which is obtained by dividing the total document number by the number of documents containing the term and taking the obtained quotient to be a base-10 logarithm:

wherein idf_iIndicates the word t_iThe reverse file frequency of (2);

d represents the total number of files in the corpus;

|{j:t_i∈d_jdenotes the word t is included_iThe number of files of (a);

the TF-IDF algorithm can filter common words, retain important words (keywords), and further clean data through the synonym dictionary.

The steps are a preparation working stage, wherein part of privacy policy data is input, and the output is a privacy policy data sample with a category attribute label and a keyword. The data sample is obtained from the application market and classified. Keywords refer to characteristic attributes that may represent a category.

In order to realize the multi-label classification problem, the multi-label problem is converted into the multi-classification problem, and a classifier is constructed by adopting a one-vs-all strategy.

The method comprises the following steps:

And so on.

Finally, a series of models are obtained, which are abbreviated as:

where i is (1,2,3 …, k), and k is the number of categories.

When prediction is needed, all classifiers are run once, and then the output variable with the highest probability is selected for each input variable.

And finally training a support vector machine classifier according to a one-vs-all strategy:

where i corresponds to each possible y-i, and to make a prediction, a new value of x is input, which is used to make the prediction. Inputting x in each classification model, selecting one of the order

Maximum i, i.e.

The above is automatically calculated by a program, and the output is a classifier.

And finally, classifying the items to be classified by using a classifier to obtain the mapping relation between the items to be classified and the categories and further obtain the integrity identification of the items to be classified.

According to the method, from the perspective of natural language processing technology, privacy policies in an application market are collected, characteristic attributes are obtained through analysis, and a detection model is established through classifier training. The detection model can quickly and accurately classify each term in the privacy policy by receiving the privacy policy from the application market, so that whether the privacy policy completely covers all terms required by related regulations becomes clear at a glance, the completeness detection of the privacy policy is realized, and the readability of the privacy policy is improved at the same time

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An automatic classification method based on Chinese privacy policy terms is characterized by comprising the following steps:

data processing: acquiring a plurality of applied privacy policies as a data set, manually marking the terms of the privacy policies to obtain a data set with a label, and then cleaning the data set to obtain a training sample data set;

2. The method of claim 1, wherein the data processing comprises:

acquiring data;

labeling the data;

3. The method of claim 2, wherein the method further comprises: the data annotation criteria comprises a number of classification categories, wherein the classification categories include at least one of first party collection/use, sharing/transfer/disclosure with third parties, data security, user access/editing/deletion methods, term changes, terms facing a particular demographic group, and other general information.

4. The method of claim 3, wherein the method further comprises: the data annotation criteria contained 7 classification categories, 50 attributes, and 91 values.

5. The method of claim 3, wherein the data training comprises:

TF-IDF＝TF×IDF

for the ith word ti, the TF formula is:

the IDF formula is:

wherein idf_iIndicates the word t_iThe reverse file frequency of (2);

d represents the total number of files in the corpus;

|{j:t_i∈d_jdenotes the word t is included_iThe number of files.

6. The method of claim 5, wherein the data detection comprises:

7. The method of claim 2, wherein the method further comprises: the word segmentation tool is a jieba word segmentation tool.

8. The method of claim 2, wherein the method further comprises: and removing the noise words in the data by adopting a Hadamard decommissioning vocabulary.