CN116244738A

CN116244738A - Sensitive information detection method based on graph neural network

Info

Publication number: CN116244738A
Application number: CN202211743964.6A
Authority: CN
Inventors: 虞雁群; 刘彦伸; 吴艳; 郭银锋
Original assignee: Zhejiang Yu'an Information Technology Co ltd
Current assignee: Zhejiang Yu'an Information Technology Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-06-09
Anticipated expiration: 2042-12-30
Also published as: CN116244738B

Abstract

The invention discloses a sensitive information detection method based on a graph neural network, which collects text data from the network and a user to construct a graph structure. And taking the bag-of-words model of the text as nodes of the graph, and calculating the Jacard similarity of the text as the weight of the edges between the nodes. In the training stage, the user designates sensitive information and homomorphic encryption is carried out on the sensitive information provided by the user so as to protect the privacy of the user. The graph structure is built for model training from user-specified sensitive information and information collected over the network. In the detection stage, data are obtained by scanning the information sharing platform and the hacker website, and the data are preprocessed. And adding the collected data as nodes into the graph structure to obtain a new graph structure, and detecting by using a graph neural network. According to the invention, the relation among all the documents is constructed through the graph structure, so that the labor cost is reduced. Meanwhile, privacy of user sensitive data is protected through a homomorphic encryption method for the user data.

Description

Sensitive information detection method based on graph neural network

Technical Field

The invention relates to the field of information security, relates to a sensitive information monitoring technology, and in particular relates to a sensitive information detection method based on a graph neural network.

Background

The enterprise sensitive data contains sensitive information of the user, and once the enterprise sensitive data is leaked, huge economic loss is brought to the enterprise, and the user is also bothered. Therefore, how to guarantee the security of enterprise sensitive data becomes the key point of enterprise information protection work. The traditional method uses manual comparison, has high labor cost and poor flexibility, and can leak sensitive information to detection personnel. With the development of deep learning, some students use a word vector method to detect sensitive information. However, the content-based detection method ignores the relation between texts, and easily leads to high false alarm rate of the model. The invention constructs the relation between texts through the graph structure, simultaneously carries out homomorphic encryption on data, and carries out calculation under the condition of protecting the privacy of users. The method solves the problem of over-high false alarm rate caused by the relation between the leakage of sensitive information and the lack of text in the detection process in the prior method.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention aims to provide a sensitive information detection method based on a graph neural network, and the relation among all documents is constructed through a graph structure, so that the problems of high missing report rate and false report rate in the traditional method are solved, and the labor cost is reduced. Meanwhile, privacy of user sensitive data is protected through a homomorphic encryption method for the user data.

In order to achieve the above object, the present invention is realized by the following technical scheme: a sensitive information detection method based on a graph neural network comprises the following steps:

1. the training set is collected and the training data is divided into two parts. The first part is non-sensitive information collected at the code sharing platform by crawlers and humans. Analyzing the collected information by using Beau fulSoup library, and filtering to only leave text information and setting the label as (0, 1) ^T . The second part is sensitive information provided by the user, and the label is set to (1，0，0，0) ^T ，(0，1，0，0) ^T ，(0，0，1，0) ^T Indicating high, medium, and low sensitivity levels, respectively.

2. And constructing a graph structure according to the training set for training. In the graph structure, a bag-of-words model of the text in the training set is used as a node in the graph structure, and the Jacard similarity between the texts is used as the weight of edges between the nodes. The specific formula is as follows:

wherein D is _i ，D _j Representing a set of words obtained after segmentation of two documents, |D _i I represents word set D _i Number of Chinese words, |D _i ∩D _j I represents word set D _i And D _j Word number of intersection, |D _i ∪D _j I represents word set D _i And D _j Word number of union, |D _i -D _j I represents word set D _i And D _j The number of words in the difference set, α, is a super parameter used to adjust the penalty size introduced by the different document lengths.

3. Training the graph neural network, and training the graph neural network by using the graph structure constructed by the training set. Training of the graph neural network is performed by sampling the subgraph until the loss function is not falling.

4. The detection data are collected, the data in the information sharing platform are collected through a crawler and manual method, and the source of the collected information and the collection time are recorded. After parsing using the beautifulso library, only text information is retained. The detection data and training data are used together to construct a graph structure for detection.

5. And (4) classifying nodes of the graph structure in the step (4) by using a trained graph neural network.

6. Judging whether the detected data has sensitive data or not, if no sensitive information exists, not performing any processing, and if any sensitive information of any sensitive level is found, recording the sensitive level of the sensitive information. And (4) generating a sensitive information record according to the source and the collection time of the data in the step 4.

The specific recording structure is as follows:

sensitive information numbering

Sensitive information level

Sensitive information sources

Sensitive information collection time

Preferably, in order to protect the privacy of the sensitive information of the user, the sensitive information of step 1 is encrypted by using homomorphic encryption technology.

Preferably, the model of the step 3 is trained by using a sampling sub-graph mode. The updating of the nodes occurs in one sub-graph, not the entire graph. The subgraph is composed of neighbor nodes obtained by random sampling of all neighbors of the node.

Preferably, the loss function of the step 3 is CrossEntropy Loss. The specific formula is as follows:

n is the total number of nodes, y _i True label, a, representing node i _i The predictive label, σ, representing node i is the softmax activation function.

The invention has the following beneficial effects:

1. the mode of deep learning is used for replacing manual detection, so that the labor cost is greatly reduced.

2. The graph structure is used for representing the whole text set, so that the relation among the texts can be constructed, the false alarm rate is reduced, and the detection effect of the model is improved. When the test chart structure is constructed, a plurality of documents can be added at the same time, so that the detection efficiency is greatly improved.

3. Using homomorphic encryption, the privacy of user data is protected while the user can specify sensitive information.

Drawings

The invention is described in detail below with reference to the drawings and the detailed description;

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of the neural network training of the present invention;

fig. 3 is an overall architecture diagram of the present invention.

Detailed Description

The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

Referring to fig. 1-3, the present embodiment adopts the following technical scheme: a sensitive information detection method based on a graph neural network comprises the following steps:

step 101: information in code hosting platforms, net disks, libraries, hack forums and darknets is collected by means of crawlers and manual collection, and the source of the collected information and the time of collection are recorded.

Step 102: the crawled data is often in an HTML format, and the data is analyzed through BeautiffulSoup and filtered to obtain text information.

Step 103: the graph structure is built through the processed data, the data are encrypted homomorphically, the bag-of-words model of the document is used as the nodes in the graph structure, the weight of the edges between the nodes is defined through calculating the Jacord similarity between the documents, and the specific formula is as follows:

step 104: the graph is input into a graph neural network,and classifying the input graph nodes through the graph neural network, and finally outputting the prediction labels of each node. Label (1, 0) ^T ，(0，1，0，0) ^T ，(0，0，1，0) ^T ，(0，0，0，1) ^T Respectively representing high-sensitive information, medium-sensitive information, low-sensitive information and non-sensitive information. The input of the model is a graph, nodes and edges are added into the graph structure constructed in the training process to form the graph structure for testing, and 20 nodes are added each time, so that the model can detect whether 20 documents contain sensitive information or not simultaneously.

Step 105: and judging whether the input document has sensitive information, if not, not performing any processing, and if found to have sensitive information with any sensitive level, recording the sensitive level of the sensitive information.

Step 106: generating a sensitive information record according to the source and the collection time of the sensitive data obtained in the step 101, wherein the specific record structure is as follows:

sensitive information numbering

Sensitive information level

Sensitive information sources

Sensitive information collection time

Step 107: and judging whether the detection needs to be continued, if the graph structure needs to be built again by using the data for detection, otherwise, exiting the program.

Example 1: as shown in fig. 2, the user input and the data collected by the network construct a graph structure, the graph neural network is trained by sampling subgraphs until the model converges, the process comprising the steps of:

step 401A: the user designates sensitive information for matching and defines the sensitivity level of the sensitive information as (1, 0) from top to bottom according to the one-hot encoding mode ^T ，(0，1，0，0) ^T ，(0，0，1，0) ^T 。

Step 401B: information in code hosting platforms, network disks, libraries, hacker forums, and darknets is collected. The collected information is common non-sensitive information, and the label is set as (0, 1) ^T 。

Steps

402A, 402B: and (3) cleaning the data, wherein most of the data obtained by using the crawler are of an HTML structure, analyzing the data by using BeautiflulSoup, and filtering to obtain text information.

Step 403A: in order to protect the data privacy of the user, the user data is homomorphic encrypted. Homomorphic encryption is used as a multiparty secure computing technology, and after the result obtained by processing the ciphertext is decrypted, the plaintext can be computed under the condition of encrypting the plaintext as the result obtained by processing the plaintext.

Step 404: constructing a graph structure, wherein nodes of the graph are word bag models of texts, and the weights of edges between the nodes are defined by calculating the Jacard similarity between documents, wherein the concrete formula is as follows:

step 405: and training the graph neural network through the constructed graph structure, and updating the characteristics of the nodes by using a sampling subgraph in order to avoid overlarge memory consumption during training. Each update of a node is within a sub-graph of the entire graph structure, rather than the entire graph. By the method, the memory required by training can be reduced, and the accuracy of the model for detecting the nodes which are not seen can be improved. The update formula is as follows:

step 406: training until the model converges, and storing the model for detecting sensitive information. Model convergence is defined as the loss function being sufficiently low and not falling, the loss function being specifically formulated as follows:

n is the total number of nodes, y _i True label, a, representing node i _i Representing the predictive label of node i.

Based on the sensitive information detection method realized by the embodiment of the invention, the tedious and highly repeated manual detection is converted into automatic detection through the graph neural network, so that the user privacy can be protected while the cost can be reduced, the detection accuracy is improved, and the false alarm rate is reduced. A plurality of texts to be detected are added in one diagram, so that a plurality of texts can be detected simultaneously, and the detection efficiency is greatly improved.

The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The sensitive information detection method based on the graph neural network is characterized by comprising the following steps of:

(1) Collecting a training set, wherein training data is divided into two parts; the first part is non-sensitive information collected on the code sharing platform by a crawler and manually; analyzing the collected information by using Beau fulSoup library, and filtering to only leave text information and setting the label as (0, 1) ^T The method comprises the steps of carrying out a first treatment on the surface of the The second part is sensitive information provided by the user, set the label to (1, 0) ^T ，(0，1，0，0) ^T ，(0，0，1，0) ^T Respectively representing high, medium and low sensitivity levels;

(2) Constructing a graph structure according to the training set for training; in the graph structure, a bag-of-words model of the text in the training set is used as a node in the graph structure, and the Jacard similarity between the texts is used as the weight of edges between the nodes;

(3) Training a graph neural network, wherein the graph neural network is trained by using a graph structure constructed by a training set; training the graph neural network in a way of sampling the subgraph until the loss function is not reduced;

(4) Collecting detection data, collecting data in an information sharing platform by a crawler and manual method, and recording the source of the collected information and the collection time; after analysis is carried out by using the BeautiffulSoup library, only text information is reserved; constructing a graph structure for detection using the detection data together with the training data;

(5) Performing node classification on the graph structure in the step 4 by using a trained graph neural network;

(6) Judging whether the detected data has sensitive data or not, if no sensitive information exists, not performing any processing, and if any sensitive information of any sensitive level is found, recording the sensitive level of the sensitive information; generating a record of sensitive information based on the source and collection time of the data of step (4).

2. The sensitive information detection method based on the graph neural network according to claim 1, wherein the specific formula of the step (2) is as follows:

wherein D is _i ，D _j Representing a set of words obtained after segmentation of two documents, |D _i I represents word set D _i Number of Chinese words, |D _i ∩D _j I represents word set D _i And D _j Word number of intersection, |D _i ∪D _j I meaning wordSet D _i And D _j Word number of union, |D _i -D _j I represents word set D _i And D _j The number of words in the difference set, α, is a super parameter used to adjust the penalty size introduced by the different document lengths.

3. The method for detecting sensitive information based on a neural network according to claim 1, wherein the sensitive information in step (1) is encrypted by using homomorphic encryption technology in order to protect the privacy of the sensitive information of the user.

4. The sensitive information detection method based on a graph neural network according to claim 1, wherein the model in the step (3) is trained by using a sampling subgraph; the updating of the nodes occurs in one sub-graph, not the entire graph; the subgraph is composed of neighbor nodes obtained by random sampling of all neighbors of the node.

5. The method for detecting sensitive information based on a graph neural network according to claim 1, wherein the loss function of the step (3) is CrossEntropy Loss; the specific formula is as follows: