CN112364165A - Automatic classification method based on Chinese privacy policy terms - Google Patents
Automatic classification method based on Chinese privacy policy terms Download PDFInfo
- Publication number
- CN112364165A CN112364165A CN202011261262.5A CN202011261262A CN112364165A CN 112364165 A CN112364165 A CN 112364165A CN 202011261262 A CN202011261262 A CN 202011261262A CN 112364165 A CN112364165 A CN 112364165A
- Authority
- CN
- China
- Prior art keywords
- data
- privacy policy
- data set
- word
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 23
- 238000001514 detection method Methods 0.000 claims abstract description 19
- 238000012545 processing Methods 0.000 claims abstract description 11
- 238000002372 labelling Methods 0.000 claims abstract description 6
- 238000004140 cleaning Methods 0.000 claims abstract description 4
- 230000011218 segmentation Effects 0.000 claims description 20
- 238000012706 support-vector machine Methods 0.000 claims description 9
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000012546 transfer Methods 0.000 claims description 5
- 238000012217 deletion Methods 0.000 claims description 4
- 230000037430 deletion Effects 0.000 claims description 4
- 239000013598 vector Substances 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 238000004458 analytical method Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 238000013475 authorization Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000009849 deactivation Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/64—Protecting data integrity, e.g. using checksums, certificates or signatures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides an automatic classification method based on Chinese privacy policy terms, which belongs to the technical field of natural language processing and comprises the following steps: data processing: acquiring privacy policies of a plurality of applications as a data set, manually labeling the data set to obtain a data set with a label, and then cleaning the data set to obtain a training sample data set; training data: selecting features of the training sample data set, selecting effective features capable of identifying different clauses and categories, and establishing a detection model; determining whether the privacy policy text has integrity. According to the automatic classification method based on the Chinese privacy policy terms, provided by the invention, through automatic classification based on the privacy policy terms, the privacy policy contents are quickly classified under various classification type attributes, so that convenience is brought to reading and understanding of a user, meanwhile, the completeness detection of the privacy policy terms is realized, and the user can quickly identify whether the privacy policy is complete or not.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to an automatic classification method based on Chinese privacy policy terms.
Background
With the increasing number of people using APP, the leakage events of the privacy data of users emerge endlessly. The APP privacy policy is a statement that discloses how user data is collected, used, shared, and managed, and is an autonomous measure for the APP operator to collect user information. However, the privacy policy is long in space, difficult to understand and long in reading time, and most users agree to the privacy policy without reading directly, so that the problem in the privacy policy is likely to be a vulnerability of privacy security of the users. Aiming at the above phenomena, in order to help the user to read the Chinese privacy policy and reflect the quality of the privacy policy, the contents of the Chinese privacy policy terms need to be automatically classified.
In the prior art, the content of the Chinese privacy policy is analyzed by a content analysis method. Through classifying and counting the contents of the Chinese privacy policy, the analysis dimension characteristics and the mutual relation are summarized, and the comparison is carried out according to the research target so as to obtain the conclusion about the current situation of the Chinese privacy policy and the like. The encoding is a key step of content analysis, but encoding a large amount of content is cumbersome, and meanwhile, due to the fact that an artificial encoding process generates errors, the intrinsic effectiveness of content analysis is low. Therefore, a simple and effective detection method is needed to find out the problems existing in the chinese privacy policy quickly, accurately and automatically, and improve the readability of the chinese privacy policy for the user to read conveniently.
Disclosure of Invention
The invention aims to provide an automatic classification method based on Chinese privacy policy terms, and aims to solve the technical problems that in the prior art, a large amount of contents are coded more complexly, and the content analysis has low intrinsic effectiveness.
In order to achieve the purpose, the invention adopts the technical scheme that: the automatic classification method based on the Chinese privacy policy terms comprises the following steps:
data processing: the method comprises the steps of obtaining a plurality of applied privacy policies as data sets, marking the terms of the privacy policies to obtain data sets with labels, and then cleaning the data sets to obtain training sample data sets;
training data: selecting features of the training sample data set, selecting effective features capable of identifying different clauses and categories, training a classifier based on the feature vectors of the clauses and the categories, and establishing a detection model;
and (3) data detection: receiving a privacy policy text through the detection model, classifying the clause content of the privacy policy text under various types of attributes, and judging whether the privacy policy text has integrity.
Further, the data processing includes:
acquiring data;
establishing a data marking standard according to the requirements of laws and regulations, wherein the data marking standard comprises all terms required to be completely covered by privacy policies in the laws and regulations;
labeling the data;
and removing noise words in the data, and performing word segmentation processing by using a word segmentation tool to obtain a clause data set with a label after word segmentation.
Further, the data annotation criteria comprises a number of classification categories, wherein the classification categories include at least one of first party collection/use, sharing/transfer/disclosure with third parties, data security, user access/editing/deletion methods, term changes, terms facing a particular demographic group, and other general information.
Further, the data annotation standard contains 7 classification categories, 50 attributes and 91 values.
Further, the data training comprises:
and performing feature selection on the training sample data set through a TF-IDF algorithm, wherein the calculation formula is as follows:
TF-IDF=TF×IDF
for the ith word ti, the TF formula is:
in the above formula, ni,jIs the word tiAt jth file djThe denominator is in the file djSum of the occurrence numbers of all words in, nk,jPresentation document djThe k-th word in the document djNumber of occurrences of, tfi,jIndicates the word tiIn document djThe word frequency of (1);
the IDF formula is:
wherein idfiIndicates the word tiThe reverse file frequency of (2);
d represents the total number of files in the corpus;
|{j:ti∈djdenotes the word t is includediThe number of files.
Further, the data detection includes:
and (3) calculating classification probability: calculating a support vector machine classifier trained by each category i in the privacy policy text, and predicting the probability of y being i, wherein i being (1,2,3 …, k) k is the number of categories;
selecting categories: for a given new input x, taking one classification class with the highest probability of predicting y to i by the classifier trained by each classification class as the classification class of the new input x.
Further, the word segmentation tool is a Jieba word segmentation tool.
Further, removing the noise words in the data by adopting a Hadamard decommissioning vocabulary.
The automatic classification method based on the Chinese privacy policy terms provided by the invention has the beneficial effects that: compared with the prior art, the automatic classification method based on the Chinese privacy policy terms, disclosed by the invention, has the advantages that the privacy policy contents are quickly classified under various classification type attributes through automatic classification based on the privacy policy terms, so that convenience is brought to reading and understanding of a user, meanwhile, the completeness detection of the privacy policy terms is realized, and the user can quickly identify whether the privacy policy is complete or not.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flowchart of an automatic classification method based on Chinese privacy policy terms according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a classification stage of a support vector machine in an automatic classification method based on Chinese privacy policy terms according to an embodiment of the present invention;
fig. 3 is a schematic diagram of input and output of three stages of support vector machine classification in the automatic classification method based on the chinese privacy policy clause according to the embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1 to fig. 3, an automatic classification method based on the terms of the chinese privacy policy according to the present invention will now be described. The automatic classification method based on the Chinese privacy policy terms comprises the following steps:
s1, data processing: acquiring a plurality of applied privacy policies as a data set, labeling the terms of the privacy policies to obtain a data set with a label, and then cleaning the data set to obtain a training sample data set;
the applications are obtained from the application marketplace or download channels in various application official websites.
The process of cleansing the data set includes: word segmentation and de-noising of the sound words.
The specific implementation of the step can be as follows:
s1.1, acquiring data;
privacy policies are obtained by web crawlers for mobile application markets or/and internet websites.
More specifically, the privacy policy of the 100 popular applications in the application market is obtained through the web crawler.
S1.2, establishing a data marking standard according to the requirements of laws and regulations, wherein the data marking standard comprises all terms required to be completely covered by privacy policies in the laws and regulations;
the data annotation criteria comprises a number of classification categories, wherein the classification categories include at least one of first party collection/use, sharing/transfer/disclosure with third parties, data security, user access/editing/deletion methods, term changes, terms facing a particular demographic group, and other general information.
More specifically, the data annotation criteria contained 7 classification categories, 50 attributes, and 91 values. Where the classification category represents a basic classification, such as: the First Party collects the usage, shares with the Third Party, etc., and respectively represents the usage, the sharing with the Third Party, etc., by the tags of First-Party-Collect-Use, Third-Party-Share, etc. The attribute represents the specific content corresponding to the basic classification, for example, the collection and Use of the First Party are continuously divided into collection purposes, the selection of the User and the like, and the First Party is represented by tags such as First-Party-Collection-User-Purpose, First-Party-Collection-User-Collection and the like. For the attribute, corresponding value options are designed, for example, whether the interactive attribute sets two values of 'yes' and 'no'.
S1.3, marking the data;
on the basis of fully understanding the classification standard, the privacy policy is labeled by using an online labeling tool BRAT. The consistency of the data annotation is tested by Cohen's kappa coefficient, and the content of the data annotation is proved to be credible.
More specifically, this step ultimately resulted in a data set containing 100 chinese privacy policy terms, including 11,440 category and attribute tags.
And S1.4, removing noise words in the data, and performing word segmentation processing by using a word segmentation tool to obtain a clause data set with a label after word segmentation.
And calling the stop word list to remove noise from the privacy policy terms, reducing noise words in the privacy policy and improving the subsequent classification effect. Illustratively, the stop word list is a Hadamard stop word list.
And performing word segmentation processing on the cleaned data through a Jieba word segmentation tool to obtain a clause data set with a label after word segmentation. The noise words are words which appear frequently and have no practical meaning, such as "and", "even", etc.
S2, training data: selecting features of a training sample data set, selecting effective features capable of identifying different clauses and categories, training a classifier based on the feature vectors of the clauses of each category, and establishing a detection model;
the specific implementation of the step can be as follows:
selecting features of a training sample data set through a TF-IDF algorithm, and selecting effective features capable of identifying different clause classification categories;
the calculation formula is as follows:
TF-IDF=TF×IDF
for the ith word ti, the TF formula is:
in the above formula, ni,jIs the word tiAt jth file djThe denominator is in the file djSum of the occurrence numbers of all words in, nk,jPresentation document djThe k-th word in the document djNumber of occurrences of, tfi,jIndicates the word tiIn document djThe word frequency of (1);
the IDF formula is:
wherein idfiIndicates the word tiThe reverse file frequency of (2);
d represents the total number of files in the corpus;
|{j:ti∈djdenotes the word t is includediThe number of files.
The classification algorithm mainly adopted by the invention is a support vector machine algorithm.
The data set is a multi-label data set and is different from the general two-classification problem, so that a classifier is constructed by adopting an One-vs-all strategy, particularly, an OneVsRestClassifier in a scinit-left toolkit is adopted for implementation, and under the condition of small sample number and large feature number, a linear support vector machine is considered, and a kernel function is used for mapping a finite-dimensional space to a high-dimensional space, so that the finite-dimensional space can be linearly classified.
The specific method comprises the following steps:
marking one of the classes as positive (y 1) and then all others as negative, this model is denoted as
Then, similarly, the second class is selected to be marked as the positive-going class (y ═ 2), and the other classes are marked as the negative-going classes, and the model is marked asAnd so on.
Finally, a series of models are obtained, which are abbreviated as:where i is (1,2,3 …, k), and k is the number of categories.
When prediction is needed, all classifiers are run once, and then the output variable with the highest probability is selected for each input variable. Finally according toAnd (3) training a support vector machine classifier by a one-vs-all strategy:where i corresponds to each possible y-i, a new value of x is input for making the prediction.
S3, data detection: and receiving the privacy policy text through the detection model, classifying the clause contents of the privacy policy text under various types of attributes, and judging whether the privacy policy text has integrity.
The specific implementation of the step can be as follows:
s3.1, calculating classification probability: calculating a probability that a support vector machine classifier trained by each classification category i in a privacy policy text predicts y-i, wherein i-i (1,2,3 …, k) k is the number of categories;
s3.2, selecting the categories: for a given new input x, taking one classification class with the highest probability of predicting y to i by the classifier trained by each classification class as the classification class of the new input x.
Based on which a privacy policy integrity check is performed.
Compared with the prior art, the automatic classification method based on the Chinese privacy policy terms, provided by the invention, has the advantages that the privacy policy contents are quickly classified under various classification type attributes through automatic classification based on the privacy policy terms, so that convenience is brought to reading and understanding of a user, meanwhile, the completeness detection of the privacy policy terms is realized, and the user can quickly identify whether the privacy policy is complete or not.
The invention provides a specific implementation mode, which comprises the following steps:
the method comprises the steps of obtaining Chinese privacy policy data from Huashi application markets, defining privacy policy term classification standards according to relevant laws and regulations, determining the content of privacy policy terms into 7 classification categories, and determining corresponding attributes under each category. The classification categories include: first party collection/use, sharing/transfer/disclosure with third parties, data security, user access/editing/deletion methods, terms change, terms facing specific groups of people, other general information;
corresponding attributes under each category: for example, the collection/use continuation of the first party is divided into collection purposes, user selection and the like, the sharing/transfer/disclosure continuation with the third party is divided into a sharing mode, constraints on the third party and the like, the data security continuation is divided into security measures, data storage time limits and the like, the user access/edit/delete method is divided into operation ways, operations which can be performed by the user and the like, the term change continuation is divided into change reasons, informing modes and the like, the terms facing a specific group are divided into user selection, supplier actions and the like, and other general information is divided into privacy policy application scope, operator information and the like.
And manually labeling terms according to the classification standard to obtain a labeled Chinese privacy policy data set. The data set is divided into two parts, one part is used for constructing a classifier, and the other part is used for detecting the accuracy of the model.
And for the marked Chinese privacy policy data, performing word segmentation on the marked clauses of each classification category through a Jieba word segmentation tool, and separating word groups by using blank spaces. And the Chinese privacy policy data introduces a deactivation word list at the same time, and words with high occurrence frequency and no practical meaning such as 'the', 'even' and the like are deleted.
The privacy policy data contains' if a person sends out overseas transmission of personal information in the process of using overseas transaction service, after obtaining your authorization agreement alone, the overseas receiver is ensured to process your personal information according to the policy description and strict security measures. For example, the result after word segmentation is [ in the process of using the overseas transaction service, if a overseas transmission of personal information is sent, after obtaining the authorization agreement of your alone, it is ensured that the overseas receiver has to process your personal information according to the policy description and strict security measures. The result after deleting the noise word is [ sending information in overseas transaction service, and overseas output authorization meaning ensuring overseas receiving information of strict security measures of root policy instructions for processing ].
And calculating the weight of each phrase in the document vector by adopting a TF-IDF algorithm, comparing the sizes of the phrase weights, and arranging the phrases from large to small according to the weights to obtain the keywords. The obtained keywords have good category distinguishing capability and can be used as characteristic attributes of the category.
Calculating the formula: TF-IDF ═ TF X IDF
Wherein, the TF term frequency represents the frequency of the term appearing in the document d; the frequency representation of the IDF reverse file is a measure of the general importance of a word, and if the number of documents containing the entry t is less, the IDF is larger, so that the entry t has good category distinguishing capability. The Term Frequency (TF) formula is: for the ith word tiIn the case of a composite material, for example,
in the above formula, ni,jIs the word tiAt jth file djThe denominator is in the file djSum of the occurrence numbers of all words in, nk,jPresentation document djThe k-th word in the document djNumber of occurrences of, tfi,jIndicates the word tiIn document djThe word frequency of (1);
IDF represents the Inverse Document Frequency (IDF), which is obtained by dividing the total document number by the number of documents containing the term and taking the obtained quotient to be a base-10 logarithm:
wherein idfiIndicates the word tiThe reverse file frequency of (2);
d represents the total number of files in the corpus;
|{j:ti∈djdenotes the word t is includediThe number of files of (a);
the TF-IDF algorithm can filter common words, retain important words (keywords), and further clean data through the synonym dictionary.
The steps are a preparation working stage, wherein part of privacy policy data is input, and the output is a privacy policy data sample with a category attribute label and a keyword. The data sample is obtained from the application market and classified. Keywords refer to characteristic attributes that may represent a category.
In order to realize the multi-label classification problem, the multi-label problem is converted into the multi-classification problem, and a classifier is constructed by adopting a one-vs-all strategy.
The method comprises the following steps:
marking one of the classes as positive (y 1) and then all others as negative, this model is denoted as
Then, similarly, the second class is selected to be marked as the positive-going class (y ═ 2), and the other classes are marked as the negative-going classes, and the model is marked asAnd so on.
Finally, a series of models are obtained, which are abbreviated as:where i is (1,2,3 …, k), and k is the number of categories.
When prediction is needed, all classifiers are run once, and then the output variable with the highest probability is selected for each input variable.
And finally training a support vector machine classifier according to a one-vs-all strategy:
where i corresponds to each possible y-i, and to make a prediction, a new value of x is input, which is used to make the prediction. Inputting x in each classification model, selecting one of the orderMaximum i, i.e.
The above is automatically calculated by a program, and the output is a classifier.
And finally, classifying the items to be classified by using a classifier to obtain the mapping relation between the items to be classified and the categories and further obtain the integrity identification of the items to be classified.
According to the method, from the perspective of natural language processing technology, privacy policies in an application market are collected, characteristic attributes are obtained through analysis, and a detection model is established through classifier training. The detection model can quickly and accurately classify each term in the privacy policy by receiving the privacy policy from the application market, so that whether the privacy policy completely covers all terms required by related regulations becomes clear at a glance, the completeness detection of the privacy policy is realized, and the readability of the privacy policy is improved at the same time
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (8)
1. An automatic classification method based on Chinese privacy policy terms is characterized by comprising the following steps:
data processing: acquiring a plurality of applied privacy policies as a data set, manually marking the terms of the privacy policies to obtain a data set with a label, and then cleaning the data set to obtain a training sample data set;
training data: selecting features of the training sample data set, selecting effective features capable of identifying different clauses and categories, training a classifier based on the feature vectors of the clauses and the categories, and establishing a detection model;
and (3) data detection: receiving a privacy policy text through the detection model, classifying the clause content of the privacy policy text under various types of attributes, and judging whether the privacy policy text has integrity.
2. The method of claim 1, wherein the data processing comprises:
acquiring data;
establishing a data marking standard according to the requirements of laws and regulations, wherein the data marking standard comprises all terms required to be completely covered by privacy policies in the laws and regulations;
labeling the data;
and removing noise words in the data, and performing word segmentation processing by using a word segmentation tool to obtain a clause data set with a label after word segmentation.
3. The method of claim 2, wherein the method further comprises: the data annotation criteria comprises a number of classification categories, wherein the classification categories include at least one of first party collection/use, sharing/transfer/disclosure with third parties, data security, user access/editing/deletion methods, term changes, terms facing a particular demographic group, and other general information.
4. The method of claim 3, wherein the method further comprises: the data annotation criteria contained 7 classification categories, 50 attributes, and 91 values.
5. The method of claim 3, wherein the data training comprises:
and performing feature selection on the training sample data set through a TF-IDF algorithm, wherein the calculation formula is as follows:
TF-IDF=TF×IDF
for the ith word ti, the TF formula is:
in the above formula, ni,jIs the word tiAt jth file djThe denominator is in the file djSum of the occurrence numbers of all words in, nk,jPresentation document djThe k-th word in the document djNumber of occurrences of, tfi,jIndicates the word tiIn document djThe word frequency of (1);
the IDF formula is:
wherein idfiIndicates the word tiThe reverse file frequency of (2);
d represents the total number of files in the corpus;
|{j:ti∈djdenotes the word t is includediThe number of files.
6. The method of claim 5, wherein the data detection comprises:
and (3) calculating classification probability: calculating a support vector machine classifier trained by each category i in the privacy policy text, and predicting the probability of y being i, wherein i being (1,2,3 …, k) k is the number of categories;
selecting categories: for a given new input x, taking one classification class with the highest probability of predicting y to i by the classifier trained by each classification class as the classification class of the new input x.
7. The method of claim 2, wherein the method further comprises: the word segmentation tool is a jieba word segmentation tool.
8. The method of claim 2, wherein the method further comprises: and removing the noise words in the data by adopting a Hadamard decommissioning vocabulary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011261262.5A CN112364165A (en) | 2020-11-12 | 2020-11-12 | Automatic classification method based on Chinese privacy policy terms |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011261262.5A CN112364165A (en) | 2020-11-12 | 2020-11-12 | Automatic classification method based on Chinese privacy policy terms |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112364165A true CN112364165A (en) | 2021-02-12 |
Family
ID=74515398
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011261262.5A Pending CN112364165A (en) | 2020-11-12 | 2020-11-12 | Automatic classification method based on Chinese privacy policy terms |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112364165A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113051607A (en) * | 2021-03-11 | 2021-06-29 | 天津大学 | Privacy policy information extraction method |
CN113076538A (en) * | 2021-04-02 | 2021-07-06 | 北京邮电大学 | Method for extracting embedded privacy policy of mobile application APK file |
CN113220877A (en) * | 2021-04-30 | 2021-08-06 | 天津大学 | Privacy policy compliance detection method |
CN113282955A (en) * | 2021-06-01 | 2021-08-20 | 上海交通大学 | Method, system, terminal and medium for extracting privacy information in privacy policy |
CN113326536A (en) * | 2021-06-02 | 2021-08-31 | 支付宝(杭州)信息技术有限公司 | Method and device for judging compliance of application program |
CN113723085A (en) * | 2021-08-26 | 2021-11-30 | 北京航空航天大学 | Pseudo-fuzzy detection method in privacy policy document |
CN115080924A (en) * | 2022-07-25 | 2022-09-20 | 南开大学 | Software license clause extraction method based on natural language understanding |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109583208A (en) * | 2018-12-03 | 2019-04-05 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Malicious software identification method and system based on mobile application comment data |
CN109657207A (en) * | 2018-11-29 | 2019-04-19 | 爱保科技(横琴)有限公司 | The formatting processing method and processing unit of clause |
CN110413789A (en) * | 2019-07-31 | 2019-11-05 | 广西师范大学 | A kind of exercise automatic classification method based on SVM |
CN110533305A (en) * | 2019-08-12 | 2019-12-03 | 北京科技大学 | A kind of smelter work safety accident Synthetical prevention method |
CN110674289A (en) * | 2019-07-04 | 2020-01-10 | 南瑞集团有限公司 | Method, device and storage medium for judging article belonged classification based on word segmentation weight |
CN110705955A (en) * | 2019-08-22 | 2020-01-17 | 阿里巴巴集团控股有限公司 | Contract detection method and device |
-
2020
- 2020-11-12 CN CN202011261262.5A patent/CN112364165A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657207A (en) * | 2018-11-29 | 2019-04-19 | 爱保科技(横琴)有限公司 | The formatting processing method and processing unit of clause |
CN109583208A (en) * | 2018-12-03 | 2019-04-05 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Malicious software identification method and system based on mobile application comment data |
CN110674289A (en) * | 2019-07-04 | 2020-01-10 | 南瑞集团有限公司 | Method, device and storage medium for judging article belonged classification based on word segmentation weight |
CN110413789A (en) * | 2019-07-31 | 2019-11-05 | 广西师范大学 | A kind of exercise automatic classification method based on SVM |
CN110533305A (en) * | 2019-08-12 | 2019-12-03 | 北京科技大学 | A kind of smelter work safety accident Synthetical prevention method |
CN110705955A (en) * | 2019-08-22 | 2020-01-17 | 阿里巴巴集团控股有限公司 | Contract detection method and device |
Non-Patent Citations (1)
Title |
---|
徐雷等: "移动APP隐私条款可获得性及内容分析研究", 《现代情报》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113051607A (en) * | 2021-03-11 | 2021-06-29 | 天津大学 | Privacy policy information extraction method |
CN113051607B (en) * | 2021-03-11 | 2022-04-19 | 天津大学 | Privacy policy information extraction method |
CN113076538A (en) * | 2021-04-02 | 2021-07-06 | 北京邮电大学 | Method for extracting embedded privacy policy of mobile application APK file |
CN113076538B (en) * | 2021-04-02 | 2021-12-14 | 北京邮电大学 | Method for extracting embedded privacy policy of mobile application APK file |
CN113220877A (en) * | 2021-04-30 | 2021-08-06 | 天津大学 | Privacy policy compliance detection method |
CN113282955A (en) * | 2021-06-01 | 2021-08-20 | 上海交通大学 | Method, system, terminal and medium for extracting privacy information in privacy policy |
CN113326536A (en) * | 2021-06-02 | 2021-08-31 | 支付宝(杭州)信息技术有限公司 | Method and device for judging compliance of application program |
CN113723085A (en) * | 2021-08-26 | 2021-11-30 | 北京航空航天大学 | Pseudo-fuzzy detection method in privacy policy document |
CN113723085B (en) * | 2021-08-26 | 2024-05-24 | 北京航空航天大学 | Pseudo-fuzzy detection method in privacy policy document |
CN115080924A (en) * | 2022-07-25 | 2022-09-20 | 南开大学 | Software license clause extraction method based on natural language understanding |
CN115080924B (en) * | 2022-07-25 | 2022-11-15 | 南开大学 | Software license clause extraction method based on natural language understanding |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112364165A (en) | Automatic classification method based on Chinese privacy policy terms | |
CN111274365B (en) | Intelligent inquiry method and device based on semantic understanding, storage medium and server | |
CN107291780B (en) | User comment information display method and device | |
TWI653542B (en) | Method, system and device for discovering and tracking hot topics based on network media data flow | |
JP4920023B2 (en) | Inter-object competition index calculation method and system | |
AU2017200585A1 (en) | System and engine for seeded clustering of news events | |
Im et al. | Linked tag: image annotation using semantic relationships between image tags | |
WO2009134462A2 (en) | Method and system to predict the likelihood of topics | |
US20100153320A1 (en) | Method and arrangement for sim algorithm automatic charset detection | |
CN107193883B (en) | Data processing method and system | |
CN113076735B (en) | Target information acquisition method, device and server | |
CN108228612B (en) | Method and device for extracting network event keywords and emotional tendency | |
CN110287292A (en) | A kind of judge's measurement of penalty irrelevance prediction technique and device | |
JP5098631B2 (en) | Mail classification system, mail search system | |
WO2023273303A1 (en) | Tree model-based method and apparatus for acquiring degree of influence of event, and computer device | |
Wagner | Privacy Policies Across the Ages: Content and Readability of Privacy Policies 1996--2021 | |
Omondiagbe et al. | Features that predict the acceptability of java and javascript answers on stack overflow | |
CN116610853A (en) | Search recommendation method, search recommendation system, computer device, and storage medium | |
CN114202443A (en) | Policy classification method, device, equipment and storage medium | |
JP3583631B2 (en) | Information mining method, information mining device, and computer-readable recording medium recording information mining program | |
Mohemad et al. | Performance analysis in text clustering using k-means and k-medoids algorithms for Malay crime documents | |
Wang et al. | A collaborative filtering algorithm fusing user-based, item-based and social networks | |
CN112434126B (en) | Information processing method, device, equipment and storage medium | |
Kotenko et al. | The intelligent system for detection and counteraction of malicious and inappropriate information on the Internet | |
Al Mahmud et al. | A New Technique to Classification of Bengali News Grounded on ML and DL Models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210212 |
|
RJ01 | Rejection of invention patent application after publication |