CN113282748B

CN113282748B - Automatic detection method for privacy text based on transformer

Info

Publication number: CN113282748B
Application number: CN202110471707.0A
Authority: CN
Inventors: 刘新; 黄浩钰; 马中昊; 李广; 张远明
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2023-05-12
Anticipated expiration: 2041-04-29
Also published as: CN113282748A

Abstract

The invention discloses an automatic detection technology for private text of an android application program based on a transducer, relates to a natural language processing technology, belongs to the technical field of computer application, and solves the problem that whether privacy policy in the android application program accords with standards in GB/T35273-2020 personal information safety Specification of information safety technology. The method mainly comprises the steps of generating sentence vectors through BERT to form an embedded matrix, using an encoder part of a transducer as a feature extractor, using a fully connected neural network and softmax to obtain classification and predicting results. Improved mutual attention mechanisms based on self-attention mechanisms are presented.

Description

Automatic detection method for privacy text based on transformer

Technical Field

The invention relates to a natural language processing technology, belongs to the technical field of computer application, and in particular relates to an automatic detection method for privacy texts based on a transducer.

Background

The transducer model was proposed by Google team in a paper titled Attention is All You Need published 2018 as a new network structure for natural language processing domain instead of RNN and CNN.

The transform and the past Seq2Seq model are also composed of two parts, namely an encoder and a decoder, wherein the encoder and the decoder both comprise 6 blocks, the core idea is a self-attribute mechanism, and the model can pay attention to different input positions to calculate the representation capability.

The encoder consists of two modules, multi-header and feed-forward networks.

The decoder consists of three modules, multi-head section, modulated Multi-head section and feed-forward networks.

The transform is a model built by a pure-state mechanism, the model does not depend on recursion and convolution, and the calculation is completely executed in parallel, so that the calculation speed is extremely high and the calculation efficiency is high, because the characteristic is completely superior to that of the prior RNN model based on a sequence-dependent structure, compared with RNN and CNN, the transform can obtain better results in translation task than the former, and the development of the subsequent BERT model is also laid.

Google in the contemporaneous published paper BERT: pre-training of Deep Bidirectional Transformers for Language Understanding proposes a BERT model, the architecture of which is based on a multi-layer transducer structure, on the basis of which a brand-new bidirectional concept is proposed.

BERT incorporates maskidlm, using bi-directional LM and Next Sentence Prediction for model pre-training.

The method proves that the larger model has better training effect, introduces a plurality of very general input layers and output layers for downstream tasks for adapting to the migration learning under multitasking, and can avoid greatly modifying or customizing a new model for the task.

The personal information safety standard of the information safety technology specifies the principle and safety requirements which must be complied with by the android application program for collecting, storing, using and the like for processing the personal information of the user, and also specifies the standard writing in the file which must be referred to by the privacy policy in the application program. The release of the file also remedies some unreasonable phenomena of the current application operation management. And further combing and perfecting various terms of the text of privacy policy.

Disclosure of Invention

The invention provides an automatic detection method for privacy texts based on a transformer, which solves the problem that whether privacy policies in android application programs accord with standards in GB/T35273-2020 personal information safety standards of information safety technology needs to be manually judged at present, and codes are written by using Python language.

In order to achieve the above object, the method for automatically detecting the privacy text based on the transformer comprises the following steps:

s1: inputting an initial text in privacy policy, firstly dividing the initial text by periods and sequence numbers, sequentially converting the initial text into vectors by BERT, forming the vectors into an embedded matrix, carrying out prediction results by a feature extractor consisting of two layers Transformer Encoder through normalization, full-connection neural network and softmax functions, outputting tag vectors corresponding to the maximum probability, and splicing the tag vectors of a plurality of sentences to form an embedded matrix representation with complete semantics;

s2: the standard sentence in the personal information safety Specification of the information safety technology is expressed as an embedded matrix through BERT;

s3: the multi-head mutual attention mechanism based on the multi-head self-attention mechanism is interacted between the embedding matrix of the privacy policy and the embedding matrix of the personal information safety standard of the information safety technology to form a vector with more perfect semantics;

s4: the vectors formed by the interaction layers and the embedded matrix of the personal information safety standard of the information safety technology are spliced and trained through a fully connected neural network, and finally the vectors are classified into two classes by using a softmax function, so that whether the privacy policy meets the standard requirements in the personal information safety standard of the information safety technology is predicted.

Step S1, inputting an initial text in the privacy policy, wherein sentence specifications in the policy text are very standard, and a document-level embedded matrix consisting of sentence vectors can be obtained by dividing sentences according to sequence numbers and periods through a BERT model.

The Encoder layer, which stacks 6 encoders through two layers of fransfo rme r, serves as a feature extractor, and Transformer Encoder consists of two parts, namely Multi-Head Attention and Feed Forward Neural Network (feedforward neural network). The feature extractor has an input dimension of 768 dimensions, a partial input to the feed forward neural network of 3072 dimensions, which is normalized and then converted to 768 dimensions, and finally passed through the fully connected neural network and classified after the fully connected layers using a softmax function defined as follows:

wherein C is how many categories in total, i is the index of the current category, S _i Is the current dimension probability value, V _i Is the output value of the full connection layer. The softmax function calculates the probability that the vector corresponds to each class, where each value is at [0,1]And all probability additions are equal to 1, and the label corresponding to the maximum probability value is selected as output. Based on Transformer Encoder semantic integrity analysis method, the semantic integrity analysis problem is converted into classification problem to be processed, so that the characteristics of semantics, grammar and the like in sentences can be effectively extracted, and an embedded matrix record A with semantic integrity can be formed ^m*768 M represents the number of privacy policy clauses and 768 represents the length of each word vector output by BERT.

Step S2, respectively obtaining an embedded matrix record B consisting of sentence vectors of standard sentences in the personal information safety Specification of information safety technology through a BERT model ^1*768 768 represents the length of each word vector output by BERT.

The input of the step S3 is the embedded matrix representation of the privacy policy and the personal information safety specification of the information safety technology, and the interaction layer formed by the 8-head mutual attention mechanism based on the improvement of the self-attention mechanism of the transducer carries out the interaction of two vectors to obtain more perfect interaction information.

Matrix A is marked as an embedded matrix with the size of m×768 of privacy policy, matrix B is marked as an embedded matrix with the size of 1×768 of personal information safety Specification of information safety technology, and Q is calculated _i ＝BW _i ^Q 、K _i ＝AW _i ^K 、V _i ＝AW _i ^V Wherein W is _i ^Q 、W _i ^K 、W _i ^V All are matrices with the size of 768 x 64, and Q is calculated _i Is a vector of size 1 x 64, K _i 、V _i Are matrices of size m x 64.

Where W is a randomly distributed value of very small weight, W _i ^Q Refers to the ith dimension of the W matrix, which is a vector, W _i ^K 、W _i ^V And the same is true. The calculated result is then divided by a dimension of K by the attentiveness mechanism of the similarity calculation to adjust the inner product so that it is not too large. The formula is as follows:

the 8-head mutual attention mechanism repeats the above steps for 8 times to calculate Z ₁ ...Z ₈ Each vector has a size of 1 x 64, and the obtained vectors are transversely spliced and multiplied by a weight matrix W with a size of 512 x 768 ⁰ After interaction, vector Z with more perfect semantics and richer contexts is obtained _1*768 。

Step S4 to Z _1*768 And B is connected with _1*768 Splicing, wherein a splicing formula is X= [ Z; b, a step of preparing a composite material; Z-B; z is B]Where Z x B is a bit-wise multiplication emphasizing the same place between two text sequences and Z-B emphasizing the different place between two text sequences.

Vector X is passed through a fully connected neural network and output through a softmax function to predict whether the criteria are met.

Drawings

FIG. 1 is a process diagram of an implementation of the present invention

FIG. 2 is a schematic diagram of BERT-based structure in the present invention

FIG. 3 is a schematic diagram of a semantic integrity analysis model structure according to the present invention

FIG. 4 is a schematic diagram of a feature extractor model in accordance with the present invention

FIG. 5 is a schematic diagram of a mutual attention mechanism model structure based on the self-attention mechanism in the present invention

Detailed Description

The invention aims at automatically detecting whether the text of privacy policy in an android application program accords with the standard in GB/T35273-2020 personal information safety Specification of information safety technology. The following are specific embodiments:

s1: initial text in privacy policy is entered firstDividing the sentence into m sentences with periods and sequence numbers, sequentially converting the sentences into vectors through BERT, forming the vectors into an embedded matrix, carrying out prediction results through a semantic integrity analysis module consisting of 6 coding blocks of two layers Transformer Encoder by normalization, full-connection neural network and softmax function, outputting label vectors corresponding to the maximum probability, splicing the label vectors of a plurality of sentences, and forming a semantically complete embedded matrix to represent A _m*768 ；

S2: the standard sentence in the personal information safety Specification of information safety technology is expressed as an embedded matrix record B through BERT _1*768 ；

S3: embedding privacy policy into matrix A by multi-head mutual attention mechanism _m*768 Embedded matrix B with personal information safety Specification of information safety technology _1*768 And 8-head mutual attention mechanisms based on multi-head self-attention mechanisms are performed for interaction, wherein the interaction formula is as follows:

Q _i ＝BW _i ^Q K _i ＝AW _i ^K V _i ＝AW _i ^V

Z＝concat(Z ₁ ...Z ₈ )·W ⁰

forming a vector Z with more perfect semantics _1*768 ；

S4: vector Z to be formed by the interaction layer _1*768 And an embedded matrix B of personal information safety Specification of information safety technology _1*768 Passing through formula x= [ Z; b, a step of preparing a composite material; Z-B; z is B]And (3) vector X is obtained after splicing, training is carried out through a fully connected neural network, finally, a softmax function is used for classifying the vector X to obtain probability of whether the vector X accords with the probability, and the class corresponding to the maximum probability value is output, so that whether the privacy policy accords with the standard requirement in the personal information safety specification of the information safety technology can be predicted.

Claims

1. The automatic detection method for the private text of the Android application program based on the Transformer is characterized by comprising the following steps of:

s1, inputting an initial text of privacy policy, breaking sentences and converting the initial text into an embedded matrix with semantics through a semantic integrity analysis module;

s2, inputting a standard text in personal information safety Specification of information safety technology, and representing a standard sentence of the standard text as an embedded matrix; the embedded matrix of the privacy policy is interacted with the embedded matrix of the personal information safety specification of the information safety technology;

s3, splicing the semantic vectors obtained after interaction with the embedded matrix of the standard sentence, and classifying and predicting the result through neural network training;

the step S2 specifically comprises the following steps:

the interaction layer formed by the 8-head mutual attention mechanism based on the improvement of the self-attention mechanism of the transducer carries out the interaction between the embedding matrix of the privacy policy and the embedding matrix of the personal information safety specification of the information safety technology;

matrix A is marked as an embedded matrix with the size of m×768 of privacy policy, matrix B is marked as an embedded matrix with the size of 1×768 of personal information safety Specification of information safety technology, and Q is calculated _i ＝BW _i ^Q 、K _i ＝AW _i ^K 、V _i ＝AW _i ^V Wherein W is _i ^Q 、W _i ^K 、W _i ^V All are matrices with the size of 768 x 64, and Q is calculated _i Is a vector of size 1 x 64, K _i 、V _i Are matrices with the size of m multiplied by 64, m represents the number of clauses of privacy policy, W _i ^Q Refers to the ith dimension of the W matrix, which is a vector, W _i ^K 、W _i ^V Similarly, W is a weight matrix with randomly distributed values of very small; i is the ith mutual attention mechanism;

then dividing the attention mechanism of similarity calculation by a dimension of K, and the calculation formula is as follows:

wherein; d, d _k Represents the dimension of K;

then inputting A and B into the mutual attention mechanism, repeating the above steps for 8 times, calculating Z ₁ ...Z ₈ Each vector has a size of 1×64, and the obtained vectors are transversely spliced and multiplied by a weight matrix W with a size of 512×768 ⁰ Obtaining semantic vector record Z after interaction _1*768 。

2. The automatic detection method for the privacy text of the Android application program based on the Transformer according to claim 1, wherein in the S1 process, the BERT model is utilized to carry out hierarchical representation on the document in the privacy policy according to the clause, and the words are formed into sentences first and then the sentences are formed into an embedded matrix.

3. The method for automatically detecting the privacy text of the Android application program based on the Transformer according to claim 1, wherein in the S1 process, semantic and grammar features in sentences are extracted by using an Encoder part of the Transformer as a feature extractor, so that the semantic with the context is ensured.