CN115759043A

CN115759043A - Document-level sensitive information detection model training and prediction method

Info

Publication number: CN115759043A
Application number: CN202211434726.7A
Authority: CN
Inventors: 张震; 孙旭东; 刘发强; 刘睿霖
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-03-07

Abstract

The invention relates to a document-level sensitive information detection model training and predicting method, which comprises the following steps: acquiring a training sample set; coding each sentence in the document by using a context coder to obtain the context representation of each word in the sentence, and generating a document-level entity attention weight graph according to the relationship on the shortest dependence path in the sentence and the correlation strength of the context semantics; inputting the attention weight graph into a graph convolution neural network to obtain a document-level cross sentence semantic structure, and updating the attention weight graph according to the document-level cross sentence semantic structure; inputting the updated attention weight graph into a classifier to obtain a classification score; and calculating a loss value according to the classification score and the label, and training the context encoder, the graph convolution neural network and the classifier according to the loss value to obtain a trained model.

Description

Document-level sensitive information detection model training and prediction method

Technical Field

The invention relates to the field of natural language processing, in particular to a document-level sensitive information detection model training and predicting method.

Background

With the development of information technology, more and more messages are spread and developed through various channels, wherein a large amount of sensitive information causes a great deal of damage to social stability. The quantity of the sensitive information far exceeds the range of the capability of manual examination, and the identification and the mining need to be carried out by an automatic artificial intelligence technology, the deep semantic analysis is carried out on the internet text, and the sensitive information is extracted.

Early research focused on predicting relationships between entities in sentences. However, valuable relationship information between entities is expressed by multiple mentions across sentence boundaries in a real-world scene. Therefore, the extraction range of the relational extraction has been expanded to the cross-sentence level in recent years.

One more challenging, but more practical extension is document-level relationship extraction, where the system needs to understand multiple sentences to infer relationships between entities by synthesizing relevant information from the entire document. However, how to effectively collect relevant information in a document remains a challenging research problem. Existing methods, which analyze the full-text document as a whole, often suffer from poor performance due to interference from large amounts of irrelevant information in the document. Meanwhile, the lack of annotation data can also cause the model to be over-fit on the training set, making it difficult to find an accurate boundary between sensitive and non-sensitive documents.

Disclosure of Invention

The invention aims to provide a document-level sensitive information detection model training and predicting method, and aims to solve the problems that enough manual labeling data is lacked, rich cross-sentence semantic interaction cannot be captured and the like in the traditional method.

In a first aspect, a document-level sensitive information detection model training method is provided, including:

obtaining a training sample set, wherein the training sample set comprises a plurality of documents and a label corresponding to each document, and each document is a positive sample or a negative sample;

coding each sentence in the document by using a context coder to obtain a context representation of each word in the sentence, and generating a document-level entity attention weight graph according to the relation on the shortest dependence path in the sentence and the correlation strength of the context semantics;

inputting the attention weight graph into a graph convolution neural network to obtain a document-level cross sentence semantic structure, and updating the attention weight graph according to the document-level cross sentence semantic structure;

inputting the updated attention weight graph into a classifier to obtain a classification score;

and calculating a loss value according to the classification score and the label, and training the context encoder, the graph convolution neural network and the classifier according to the loss value to obtain a trained model.

In one possible embodiment, the obtaining a training sample set includes:

acquiring a manually marked document;

capturing an internet text by using a web crawler, preprocessing the internet text, and generating a document corpus to be detected;

performing data enhancement on the manually marked document by using a pre-training language model to obtain a first positive sample set and a first negative sample set;

performing data enhancement on the corpus of the document to be detected by using a preset rule to obtain a second positive sample set;

and summarizing the first positive sample set, the second positive sample set and the first negative sample set to obtain a training sample set.

In a possible implementation manner, the performing data enhancement on the manually labeled document by using a pre-trained language model to obtain a first positive sample set and a first negative sample set includes:

determining words with similar and opposite semantics for the manually labeled documents according to the sensitive word candidate set and an embedded module in a pre-training language model and based on the distance between word vectors;

replacing corresponding words in the document by using the words with similar semantics to obtain a first positive sample set;

and replacing the corresponding words in the document by using the words with opposite semantics to obtain a first negative sample set.

In a possible implementation manner, the performing data enhancement on the corpus of the document to be detected by using a preset rule to obtain a second positive sample set includes:

modifying each document in the corpus of the documents to be detected by using one of synonym replacement, random insertion, random deletion, random exchange and invariant retention operation;

and summarizing the modified documents to obtain a second positive sample set.

In one possible embodiment, the method further comprises:

and calculating the distance between the intermediate features generated in different steps of the positive sample and the negative sample during model reasoning by using a measurement function, and optimizing model parameters by taking the reduction of the distance of the same-class sample and the increase of the distance of the non-same-class sample as targets.

In one possible embodiment, the metric function is a cross-entropy loss function with temperature coefficients.

In one possible implementation, the context encoder is a Rotry Transformer model.

In a possible embodiment, the generating a document-level entity attention weight map according to the relationship on the shortest dependency path in the sentence and the association strength of the context semantics includes:

and calculating the shortest path dependency among the words in the sentence, and constructing a document level entity attention weight graph by analyzing the attention relation in the dependency path and the embedded expression of the words.

In one possible embodiment, the inputting the updated attention weight map into the classifier to obtain a classification score includes:

for each entity pair(s) in the attention weight map _i ,s _j ) Calculating the probability of the relationship type r between the entity pairs using the following formula:

P(r|s _i ,s _j )＝σ(W ₁ [s _i ,s _j ]+b ₁ )+σ(W ₂ [s _j ,s _i ]+b ₂ )

wherein W ∈ R ^2d×k Is a trainable weight, b ∈ R ^k Is a trainable deviation, k is the number of relationship classes, σ is the Sigmoid function, [ s ] _i ,s _j ]Representing two vectors s _i ,s _j Splicing results in the last one-dimensional direction in sequence;

and determining a classification score according to the entity pair and the probability.

In a second aspect, a document-level sensitive information detection model prediction method is provided, including:

obtaining a sample set to be predicted, wherein the sample set to be predicted comprises a plurality of documents;

using a trained context encoder to encode each sentence in the document to obtain a context representation of each word in the sentence, and generating a document-level entity attention weight graph according to the relation on the shortest dependence path in the sentence and the correlation strength of the context semantics;

inputting the attention weight graph into a trained graph convolution neural network to obtain a document-level cross sentence semantic structure, and updating the attention weight graph according to the document-level cross sentence semantic structure;

inputting the updated attention weight graph into a trained classifier to obtain a classification score;

and predicting the document according to the classification score.

According to the document-level sensitive information detection model training and prediction method, the potential document-level graph is mined through the model to enhance cross-sentence relation reasoning, so that the model can be fully trained from a small amount of manually marked data, the performance of the model is improved, and the purpose of finding sensitive information from massive real Internet texts is achieved.

Drawings

FIG. 1 is a flowchart of a document-level sensitive information detection model training method disclosed in an embodiment of the present invention;

FIG. 2 is a flowchart of a training data preprocessing method according to an embodiment of the present invention;

fig. 3 is a flowchart of a document-level sensitive information detection model prediction method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Before describing the embodiments of the present application in detail, the terms and symbols used in the embodiments of the present application will be explained first.

RoFormer: a Rotry transform model is an improved transform model, and discloses a paper of the Rotry transform model, namely an Enhanced transform with Rotry Position Embedding, with the address https:// axiv.

GCNs: graph constraint Neural Networks.

For the convenience of understanding the embodiments of the present invention, the following detailed description will be given with reference to the accompanying drawings, which are not intended to limit the embodiments of the present invention.

The key information of an article may be concentrated in one segment or may be distributed everywhere, and in order to deal with the latter situation, it is necessary to gradually link the evidences in the document and make inferences step by step so as to deduce the relationship between the entities in the whole article.

Fig. 1 is a flowchart of a document-level sensitive information detection model training method according to an embodiment of the present invention. As shown in fig. 1, the method at least comprises: 101, obtaining a training sample set, wherein the training sample set comprises a plurality of documents and a label corresponding to each document, and each document is a positive sample or a negative sample; 102, coding each sentence in the document by using a context coder to obtain a context representation of each word in the sentence, and generating a document-level entity attention weight graph according to the relation on the shortest dependence path in the sentence and the correlation strength of the context semantics; 103, inputting the attention weight graph into a graph convolution neural network to obtain a document-level cross sentence semantic structure, and updating the attention weight graph according to the document-level cross sentence semantic structure; step 104, inputting the updated attention weight map into a classifier to obtain a classification score; and 105, calculating a loss value according to the classification score and the label, and training the context encoder, the graph convolution neural network and the classifier according to the loss value to obtain a trained model.

First, in step 101, a training sample set is obtained, where the training sample set includes a number of documents and a label corresponding to each document, and each document is a positive sample or a negative sample.

Specifically, the method for obtaining the training sample set is shown in fig. 2. Fig. 2 is a flowchart of a training data preprocessing method according to an embodiment of the present invention.

In step 201a, acquiring a manually labeled document; in step 201b, a web crawler is used to capture an internet text, and the internet text is preprocessed to generate a document corpus to be detected.

In one possible embodiment, the preprocessing the internet text comprises: delete special characters, meaningless spaces, links, pictures.

According to the technical scheme, a small amount of manually labeled data is used, and massive internet data without labels is combined for learning, so that a machine learning model can be fully trained from a small amount of labeled data, the performance of the model is improved, and the aim of finding sensitive information from massive real internet texts is fulfilled.

In step 202a, the pre-training language model is used to perform data enhancement on the manually labeled document acquired in step 201a, so as to obtain a first positive sample set and a first negative sample set.

Specifically, words with similar and opposite semantics are determined based on the distance between word vectors according to the sensitive word candidate set and an embedding module in the pre-training language model. Replacing corresponding words in the document by using words with similar semantics to obtain a first positive sample set; and replacing the corresponding words in the document by using the words with opposite semantics to obtain a first negative sample set.

In step 202b, the corpus of the document to be detected acquired in step 201b is subjected to data enhancement by using a preset rule, so as to obtain a second positive sample set.

Specifically, each document in the corpus of the documents to be detected is modified by one of synonym replacement, random insertion, random deletion, random exchange and invariant retention. The modified documents are then aggregated to obtain a second set of positive samples.

In a possible implementation, one of the above five operations is randomly adopted, and the probability preset by the five operations is: synonym replacement: 2.5%, random insertion: 5%, random deletion: 5% and random exchange: 2.5%, retention unchanged: 85 percent. I.e. a small part of the original data is modified proportionally, so that the data enhancement is carried out.

In step 203, the first positive sample set, the second positive sample set and the first negative sample set obtained in step 202a and step 202b are summarized to obtain a training sample set.

Then, returning to fig. 1, in step 102, each sentence in the document is encoded by using a context encoder to obtain a context representation of each word in the sentence, and a document-level entity attention weight map is generated according to the relationship on the shortest dependency path in the sentence and the association strength of the context semantics.

In particular toFor a given document d to be tested, each sentence d therein _i Are all input to a context encoder, which outputs d _i A contextualized representation of each word in (1). In one possible embodiment, the context encoder is a pre-trained RoFormer model. The hidden layer calculation of the encoder is shown in equation (1):

wherein,

representing a sentence d _i The hidden vector representation of the j-th character,

the j marked word input of the original sentence is shown. N represents the length of the sentence,

where d represents the dimension of the character vector.

To construct a document level graph, the present invention uses tokens on the shortest dependency path between mentions in a sentence. And the vectors of the entities and the adjacent contexts in the sentence are averaged and pooled to fuse the relevant semantic information as much as possible, as shown in formula (2):

where t represents the ordinal threshold selected to participate in pooling activation values. R _j Indicating the pooled domain within the jth entity neighbor context and i indicating the index value of the hidden vector representation within this pooled domain. r is _i And h _i Respectively representing the ordinal and activation values of the concealment vector i.

And finally, extracting the binary, ternary and quaternary relations among the entities, generating a common reference scoring matrix for all the entity pairs, performing clustering hierarchy on the common reference scoring matrix to obtain clustering scores, and generating a document-level entity attention weight graph A.

Next, in step 103, the attention weight map is input into a graph convolution neural network to obtain a document-level cross sentence semantic structure, and the attention weight map is updated according to the document-level cross sentence semantic structure.

In particular, the present invention makes inferences based on graph convolution networks, GCNs. Formally, a graph G comprising n nodes can be represented by an n × n adjacency matrix a, where a is the document-level entity attention weight graph a generated in step 102. Representation S of the layer above the node i of the l-th layer ^l-1 As input, and outputs an updated representation S ^l The convolution calculation is shown in formula (3) and formula (4):

S ^l ＝σ(AW ^l S ^l-1 +b ^l ) (3)

wherein, W ^l And b ^l Respectively, the weight matrix and the offset vector of the l-th layer. Sigma is a Sigmoid activation function.

Is the initial context representation of the ith node constructed by the node constructor.

The present invention uses dense connections to GCNs in order to capture more structural information in large document level graphs. With the help of dense connections, the invention can train a deeper model, allowing richer local and non-local information to be captured to learn better graph representation.

Unlike the traditional method that only one-time reasoning is carried out on a potential structure, the method further refines document-level associated information in a multi-layer manner based on the document-level entity attention weight graph and the context-related semantic hidden vector representation, and guides a model to deduce a cross-sentence semantic structure related to sensitive information.

Again, at step 104, the updated attention weight map is input into the classifier to obtain a classification score.

In particular, for each entity pair(s) _i ,s _j ) The probability of each relationship type r is calculated by equation (5):

wherein W ∈ R ^2d×k Is a trainable weight, b ∈ R ^k Is a trainable deviation, k is the number of relationship classes, σ is the Sigmoid function, [ s ] _i ,s _j ]Representing two vectors s _i ,s _j And (5) splicing the results in the last one-dimensional direction in sequence.

According to entity pair(s) _i ,s _j ) And probability P (r | s) _i ,s _j ) And determining a classification score.

Finally, in step 105, a loss value is calculated according to the classification score and the label, and the context encoder, the graph convolution neural network and the classifier are trained according to the loss value to obtain a trained model.

In some possible embodiments, for the intermediate feature variables generated in step 102, step 103, and step 104, a metric function is used to calculate the distance between the intermediate features generated in different steps during model inference for the positive sample and the negative sample, so as to optimize the model parameters with the goal of reducing the homogeneous sample distance and increasing the non-homogeneous sample distance.

In some embodiments, the intermediate characteristic variables include: attention matrixes, hidden layer variables, document-level entity attention weight graph A and relation type probability distribution P (r | s) of each layer of RoFormer model _i ,s _j ). The measurement function is a cross entropy loss function with temperature coefficient.

Illustratively, taking the relationship type probability distribution as an example, the cross entropy calculation formula is shown as formula (6):

wherein tau is a temperature coefficient, q and k represent probability distribution logits derived from the model, q represents query, k represents key, and k represents ₊ Represents a positive sample, and k _i Represents the ith sample, and N is the total number of positive and negative samples of the batch.

Through comparison learning training, similar vector distances are drawn close, and non-similar vector distances are drawn far, so that the model can learn a clearer and more accurate decomposition hyperplane in harmful analysis training, and overfitting on local data is avoided.

Corresponding to the model training method, the invention also discloses a document-level sensitive information detection model prediction method. Fig. 3 is a flowchart of a document-level sensitive information detection model prediction method according to an embodiment of the present invention.

In step 301, a sample set to be predicted is obtained, where the sample set to be predicted includes several documents.

In step 302, each sentence in the document is encoded by using a trained context encoder to obtain a context representation of each word in the sentence, and a document-level entity attention weight graph is generated according to the relation on the shortest dependency path in the sentence and the correlation strength of the context semantics.

In step 303, the attention weight graph is input into a trained graph convolutional neural network to obtain a document-level cross sentence semantic structure, and the attention weight graph is updated according to the document-level cross sentence semantic structure.

In step 304, the updated attention weight map is input into the trained classifier to obtain a classification score.

In step 305, the document is predicted based on the classification score.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of additional like elements in the process, method, article, or apparatus that comprises the element.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, where the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A document-level sensitive information detection model training method comprises the following steps:

inputting the updated attention weight map into a classifier to obtain a classification score;

2. The method of claim 1, wherein the obtaining a training sample set comprises:

acquiring a manually marked document;

3. The method of claim 2, wherein the performing data enhancement on the manually labeled document using a pre-trained language model to obtain a first set of positive samples and a first set of negative samples comprises:

determining words with similar and opposite semantemes for the manually marked document according to the sensitive word candidate set and an embedded module in a pre-training language model and based on the distance between word vectors;

4. The method according to claim 2, wherein the performing data enhancement on the corpus of the document to be detected by using a preset rule to obtain a second positive sample set comprises:

modifying each document in the corpus of the documents to be detected by using one of synonym replacement, random insertion, random deletion, random exchange and invariance reserving operation;

and summarizing the modified documents to obtain a second positive sample set.

5. The method of claim 1, further comprising:

and calculating the distance between the intermediate features generated by the positive sample and the negative sample in different steps during model reasoning by using a measurement function, and optimizing model parameters by taking the reduction of the distance of the same-class sample and the increase of the distance of the non-same-class sample as targets.

6. The method of claim 5, wherein the metric function is a cross entropy loss function with temperature coefficients.

7. The method of claim 1, wherein the context encoder is a Rotry transform model.

8. The method according to claim 1, wherein generating a document-level entity attention weight map according to the relationship on the shortest dependency path in the sentence and the association strength of the context semantics comprises:

and calculating the shortest path dependency among all words in the sentence, and constructing a document level entity attention weight graph by analyzing the attention relation in the dependency path and the embedded expression of the words.

9. The method of claim 1, wherein inputting the updated attention weight map into a classifier to obtain a classification score comprises:

wherein W ∈ R ^2d×k Is a trainable weight, b ∈ R ^k Is a trainable deviation, k is the number of relationship classes, σ is the Sigmoid function, [ s ] _i ,s ₎ ]Representing two vectors s _i ,s ₎ Splicing results in the last one-dimensional direction in sequence;

10. A document-level sensitive information detection model prediction method comprises the following steps:

and predicting the document according to the classification score.