CN113282748B - Automatic detection method for privacy text based on transformer - Google Patents

Automatic detection method for privacy text based on transformer Download PDF

Info

Publication number
CN113282748B
CN113282748B CN202110471707.0A CN202110471707A CN113282748B CN 113282748 B CN113282748 B CN 113282748B CN 202110471707 A CN202110471707 A CN 202110471707A CN 113282748 B CN113282748 B CN 113282748B
Authority
CN
China
Prior art keywords
matrix
information safety
text
size
privacy policy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110471707.0A
Other languages
Chinese (zh)
Other versions
CN113282748A (en
Inventor
刘新
黄浩钰
马中昊
李广
张远明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangtan University
Original Assignee
Xiangtan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangtan University filed Critical Xiangtan University
Priority to CN202110471707.0A priority Critical patent/CN113282748B/en
Publication of CN113282748A publication Critical patent/CN113282748A/en
Application granted granted Critical
Publication of CN113282748B publication Critical patent/CN113282748B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses an automatic detection technology for private text of an android application program based on a transducer, relates to a natural language processing technology, belongs to the technical field of computer application, and solves the problem that whether privacy policy in the android application program accords with standards in GB/T35273-2020 personal information safety Specification of information safety technology. The method mainly comprises the steps of generating sentence vectors through BERT to form an embedded matrix, using an encoder part of a transducer as a feature extractor, using a fully connected neural network and softmax to obtain classification and predicting results. Improved mutual attention mechanisms based on self-attention mechanisms are presented.

Description

Automatic detection method for privacy text based on transformer
Technical Field
The invention relates to a natural language processing technology, belongs to the technical field of computer application, and in particular relates to an automatic detection method for privacy texts based on a transducer.
Background
The transducer model was proposed by Google team in a paper titled Attention is All You Need published 2018 as a new network structure for natural language processing domain instead of RNN and CNN.
The transform and the past Seq2Seq model are also composed of two parts, namely an encoder and a decoder, wherein the encoder and the decoder both comprise 6 blocks, the core idea is a self-attribute mechanism, and the model can pay attention to different input positions to calculate the representation capability.
The encoder consists of two modules, multi-header and feed-forward networks.
The decoder consists of three modules, multi-head section, modulated Multi-head section and feed-forward networks.
The transform is a model built by a pure-state mechanism, the model does not depend on recursion and convolution, and the calculation is completely executed in parallel, so that the calculation speed is extremely high and the calculation efficiency is high, because the characteristic is completely superior to that of the prior RNN model based on a sequence-dependent structure, compared with RNN and CNN, the transform can obtain better results in translation task than the former, and the development of the subsequent BERT model is also laid.
Google in the contemporaneous published paper BERT: pre-training of Deep Bidirectional Transformers for Language Understanding proposes a BERT model, the architecture of which is based on a multi-layer transducer structure, on the basis of which a brand-new bidirectional concept is proposed.
BERT incorporates maskidlm, using bi-directional LM and Next Sentence Prediction for model pre-training.
The method proves that the larger model has better training effect, introduces a plurality of very general input layers and output layers for downstream tasks for adapting to the migration learning under multitasking, and can avoid greatly modifying or customizing a new model for the task.
The personal information safety standard of the information safety technology specifies the principle and safety requirements which must be complied with by the android application program for collecting, storing, using and the like for processing the personal information of the user, and also specifies the standard writing in the file which must be referred to by the privacy policy in the application program. The release of the file also remedies some unreasonable phenomena of the current application operation management. And further combing and perfecting various terms of the text of privacy policy.
Disclosure of Invention
The invention provides an automatic detection method for privacy texts based on a transformer, which solves the problem that whether privacy policies in android application programs accord with standards in GB/T35273-2020 personal information safety standards of information safety technology needs to be manually judged at present, and codes are written by using Python language.
In order to achieve the above object, the method for automatically detecting the privacy text based on the transformer comprises the following steps:
s1: inputting an initial text in privacy policy, firstly dividing the initial text by periods and sequence numbers, sequentially converting the initial text into vectors by BERT, forming the vectors into an embedded matrix, carrying out prediction results by a feature extractor consisting of two layers Transformer Encoder through normalization, full-connection neural network and softmax functions, outputting tag vectors corresponding to the maximum probability, and splicing the tag vectors of a plurality of sentences to form an embedded matrix representation with complete semantics;
s2: the standard sentence in the personal information safety Specification of the information safety technology is expressed as an embedded matrix through BERT;
s3: the multi-head mutual attention mechanism based on the multi-head self-attention mechanism is interacted between the embedding matrix of the privacy policy and the embedding matrix of the personal information safety standard of the information safety technology to form a vector with more perfect semantics;
s4: the vectors formed by the interaction layers and the embedded matrix of the personal information safety standard of the information safety technology are spliced and trained through a fully connected neural network, and finally the vectors are classified into two classes by using a softmax function, so that whether the privacy policy meets the standard requirements in the personal information safety standard of the information safety technology is predicted.
Step S1, inputting an initial text in the privacy policy, wherein sentence specifications in the policy text are very standard, and a document-level embedded matrix consisting of sentence vectors can be obtained by dividing sentences according to sequence numbers and periods through a BERT model.
The Encoder layer, which stacks 6 encoders through two layers of fransfo rme r, serves as a feature extractor, and Transformer Encoder consists of two parts, namely Multi-Head Attention and Feed Forward Neural Network (feedforward neural network). The feature extractor has an input dimension of 768 dimensions, a partial input to the feed forward neural network of 3072 dimensions, which is normalized and then converted to 768 dimensions, and finally passed through the fully connected neural network and classified after the fully connected layers using a softmax function defined as follows:
Figure GDA0004167312290000031
wherein C is how many categories in total, i is the index of the current category, S i Is the current dimension probability value, V i Is the output value of the full connection layer. The softmax function calculates the probability that the vector corresponds to each class, where each value is at [0,1]And all probability additions are equal to 1, and the label corresponding to the maximum probability value is selected as output. Based on Transformer Encoder semantic integrity analysis method, the semantic integrity analysis problem is converted into classification problem to be processed, so that the characteristics of semantics, grammar and the like in sentences can be effectively extracted, and an embedded matrix record A with semantic integrity can be formed m*768 M represents the number of privacy policy clauses and 768 represents the length of each word vector output by BERT.
Step S2, respectively obtaining an embedded matrix record B consisting of sentence vectors of standard sentences in the personal information safety Specification of information safety technology through a BERT model 1*768 768 represents the length of each word vector output by BERT.
The input of the step S3 is the embedded matrix representation of the privacy policy and the personal information safety specification of the information safety technology, and the interaction layer formed by the 8-head mutual attention mechanism based on the improvement of the self-attention mechanism of the transducer carries out the interaction of two vectors to obtain more perfect interaction information.
Matrix A is marked as an embedded matrix with the size of m×768 of privacy policy, matrix B is marked as an embedded matrix with the size of 1×768 of personal information safety Specification of information safety technology, and Q is calculated i =BW i Q 、K i =AW i K 、V i =AW i V Wherein W is i Q 、W i K 、W i V All are matrices with the size of 768 x 64, and Q is calculated i Is a vector of size 1 x 64, K i 、V i Are matrices of size m x 64.
Where W is a randomly distributed value of very small weight, W i Q Refers to the ith dimension of the W matrix, which is a vector, W i K 、W i V And the same is true. The calculated result is then divided by a dimension of K by the attentiveness mechanism of the similarity calculation to adjust the inner product so that it is not too large. The formula is as follows:
Figure GDA0004167312290000032
the 8-head mutual attention mechanism repeats the above steps for 8 times to calculate Z 1 ...Z 8 Each vector has a size of 1 x 64, and the obtained vectors are transversely spliced and multiplied by a weight matrix W with a size of 512 x 768 0 After interaction, vector Z with more perfect semantics and richer contexts is obtained 1*768
Step S4 to Z 1*768 And B is connected with 1*768 Splicing, wherein a splicing formula is X= [ Z; b, a step of preparing a composite material; Z-B; z is B]Where Z x B is a bit-wise multiplication emphasizing the same place between two text sequences and Z-B emphasizing the different place between two text sequences.
Vector X is passed through a fully connected neural network and output through a softmax function to predict whether the criteria are met.
Drawings
FIG. 1 is a process diagram of an implementation of the present invention
FIG. 2 is a schematic diagram of BERT-based structure in the present invention
FIG. 3 is a schematic diagram of a semantic integrity analysis model structure according to the present invention
FIG. 4 is a schematic diagram of a feature extractor model in accordance with the present invention
FIG. 5 is a schematic diagram of a mutual attention mechanism model structure based on the self-attention mechanism in the present invention
Detailed Description
The invention aims at automatically detecting whether the text of privacy policy in an android application program accords with the standard in GB/T35273-2020 personal information safety Specification of information safety technology. The following are specific embodiments:
s1: initial text in privacy policy is entered firstDividing the sentence into m sentences with periods and sequence numbers, sequentially converting the sentences into vectors through BERT, forming the vectors into an embedded matrix, carrying out prediction results through a semantic integrity analysis module consisting of 6 coding blocks of two layers Transformer Encoder by normalization, full-connection neural network and softmax function, outputting label vectors corresponding to the maximum probability, splicing the label vectors of a plurality of sentences, and forming a semantically complete embedded matrix to represent A m*768
S2: the standard sentence in the personal information safety Specification of information safety technology is expressed as an embedded matrix record B through BERT 1*768
S3: embedding privacy policy into matrix A by multi-head mutual attention mechanism m*768 Embedded matrix B with personal information safety Specification of information safety technology 1*768 And 8-head mutual attention mechanisms based on multi-head self-attention mechanisms are performed for interaction, wherein the interaction formula is as follows:
Q i =BW i Q K i =AW i K V i =AW i V
Figure GDA0004167312290000051
Z=concat(Z 1 ...Z 8 )·W 0
forming a vector Z with more perfect semantics 1*768
S4: vector Z to be formed by the interaction layer 1*768 And an embedded matrix B of personal information safety Specification of information safety technology 1*768 Passing through formula x= [ Z; b, a step of preparing a composite material; Z-B; z is B]And (3) vector X is obtained after splicing, training is carried out through a fully connected neural network, finally, a softmax function is used for classifying the vector X to obtain probability of whether the vector X accords with the probability, and the class corresponding to the maximum probability value is output, so that whether the privacy policy accords with the standard requirement in the personal information safety specification of the information safety technology can be predicted.

Claims (3)

1. The automatic detection method for the private text of the Android application program based on the Transformer is characterized by comprising the following steps of:
s1, inputting an initial text of privacy policy, breaking sentences and converting the initial text into an embedded matrix with semantics through a semantic integrity analysis module;
s2, inputting a standard text in personal information safety Specification of information safety technology, and representing a standard sentence of the standard text as an embedded matrix; the embedded matrix of the privacy policy is interacted with the embedded matrix of the personal information safety specification of the information safety technology;
s3, splicing the semantic vectors obtained after interaction with the embedded matrix of the standard sentence, and classifying and predicting the result through neural network training;
the step S2 specifically comprises the following steps:
the interaction layer formed by the 8-head mutual attention mechanism based on the improvement of the self-attention mechanism of the transducer carries out the interaction between the embedding matrix of the privacy policy and the embedding matrix of the personal information safety specification of the information safety technology;
matrix A is marked as an embedded matrix with the size of m×768 of privacy policy, matrix B is marked as an embedded matrix with the size of 1×768 of personal information safety Specification of information safety technology, and Q is calculated i =BW i Q 、K i =AW i K 、V i =AW i V Wherein W is i Q 、W i K 、W i V All are matrices with the size of 768 x 64, and Q is calculated i Is a vector of size 1 x 64, K i 、V i Are matrices with the size of m multiplied by 64, m represents the number of clauses of privacy policy, W i Q Refers to the ith dimension of the W matrix, which is a vector, W i K 、W i V Similarly, W is a weight matrix with randomly distributed values of very small; i is the ith mutual attention mechanism;
then dividing the attention mechanism of similarity calculation by a dimension of K, and the calculation formula is as follows:
Figure FDA0004167312280000011
wherein; d, d k Represents the dimension of K;
then inputting A and B into the mutual attention mechanism, repeating the above steps for 8 times, calculating Z 1 ...Z 8 Each vector has a size of 1×64, and the obtained vectors are transversely spliced and multiplied by a weight matrix W with a size of 512×768 0 Obtaining semantic vector record Z after interaction 1*768
2. The automatic detection method for the privacy text of the Android application program based on the Transformer according to claim 1, wherein in the S1 process, the BERT model is utilized to carry out hierarchical representation on the document in the privacy policy according to the clause, and the words are formed into sentences first and then the sentences are formed into an embedded matrix.
3. The method for automatically detecting the privacy text of the Android application program based on the Transformer according to claim 1, wherein in the S1 process, semantic and grammar features in sentences are extracted by using an Encoder part of the Transformer as a feature extractor, so that the semantic with the context is ensured.
CN202110471707.0A 2021-04-29 2021-04-29 Automatic detection method for privacy text based on transformer Active CN113282748B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110471707.0A CN113282748B (en) 2021-04-29 2021-04-29 Automatic detection method for privacy text based on transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110471707.0A CN113282748B (en) 2021-04-29 2021-04-29 Automatic detection method for privacy text based on transformer

Publications (2)

Publication Number Publication Date
CN113282748A CN113282748A (en) 2021-08-20
CN113282748B true CN113282748B (en) 2023-05-12

Family

ID=77277662

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110471707.0A Active CN113282748B (en) 2021-04-29 2021-04-29 Automatic detection method for privacy text based on transformer

Country Status (1)

Country Link
CN (1) CN113282748B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538906A (en) * 2020-05-29 2020-08-14 支付宝(杭州)信息技术有限公司 Information pushing method and device based on privacy protection

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9887913B2 (en) * 2015-07-10 2018-02-06 Telefonaktiebolaget L M Ericsson (Publ) CCN name chaining
CN111753322B (en) * 2020-07-03 2021-10-01 烟台中科网络技术研究所 Automatic verification method and system for mobile App permission list
CN112308370B (en) * 2020-09-16 2024-03-05 湘潭大学 Automatic subjective question scoring method for thinking courses based on Transformer
CN112214591B (en) * 2020-10-29 2023-11-07 腾讯科技(深圳)有限公司 Dialog prediction method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538906A (en) * 2020-05-29 2020-08-14 支付宝(杭州)信息技术有限公司 Information pushing method and device based on privacy protection

Also Published As

Publication number Publication date
CN113282748A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN111611377B (en) Knowledge distillation-based multi-layer neural network language model training method and device
CN110059188B (en) Chinese emotion analysis method based on bidirectional time convolution network
CN111401077B (en) Language model processing method and device and computer equipment
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
CN113128214B (en) Text abstract generation method based on BERT pre-training model
CN110298403A (en) The sentiment analysis method and system of enterprise dominant in a kind of financial and economic news
CN112446215B (en) Entity relation joint extraction method
Patel et al. Deep learning for natural language processing
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
CN114781392A (en) Text emotion analysis method based on BERT improved model
CN114462420A (en) False news detection method based on feature fusion model
CN113705315A (en) Video processing method, device, equipment and storage medium
Boudad et al. Exploring the use of word embedding and deep learning in arabic sentiment analysis
CN114416969A (en) LSTM-CNN online comment sentiment classification method and system based on background enhancement
CN116049387A (en) Short text classification method, device and medium based on graph convolution
CN113282748B (en) Automatic detection method for privacy text based on transformer
CN112257432A (en) Self-adaptive intention identification method and device and electronic equipment
CN114399646B (en) Image description method and device based on transform structure
Deng et al. Towards learning a joint representation from transformer in multimodal emotion recognition
CN113128199B (en) Word vector generation method based on pre-training language model and multiple word information embedding
CN114004220A (en) Text emotion reason identification method based on CPC-ANN
CN113255360A (en) Document rating method and device based on hierarchical self-attention network
CN114254175A (en) Method for extracting generative abstract of power policy file
CN114064888A (en) Financial text classification method and system based on BERT-CNN
Sun et al. Text sentiment polarity classification method based on word embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20210820

Assignee: Beijing Zhilian start Information Technology Co.,Ltd.

Assignor: XIANGTAN University

Contract record no.: X2023980054644

Denomination of invention: Transformer based automatic detection method for private text

Granted publication date: 20230512

License type: Common License

Record date: 20231229

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20210820

Assignee: Hunan Jiuzhang Zhiyun Technology Co.,Ltd.

Assignor: XIANGTAN University

Contract record no.: X2024980000475

Denomination of invention: Transformer based automatic detection method for private text

Granted publication date: 20230512

License type: Common License

Record date: 20240115