CN116244738A - Sensitive information detection method based on graph neural network - Google Patents

Sensitive information detection method based on graph neural network Download PDF

Info

Publication number
CN116244738A
CN116244738A CN202211743964.6A CN202211743964A CN116244738A CN 116244738 A CN116244738 A CN 116244738A CN 202211743964 A CN202211743964 A CN 202211743964A CN 116244738 A CN116244738 A CN 116244738A
Authority
CN
China
Prior art keywords
sensitive information
graph
data
neural network
sensitive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211743964.6A
Other languages
Chinese (zh)
Inventor
虞雁群
刘彦伸
吴艳
郭银锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Yu'an Information Technology Co ltd
Original Assignee
Zhejiang Yu'an Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Yu'an Information Technology Co ltd filed Critical Zhejiang Yu'an Information Technology Co ltd
Priority to CN202211743964.6A priority Critical patent/CN116244738A/en
Publication of CN116244738A publication Critical patent/CN116244738A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/008Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols involving homomorphic encryption
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6263Protecting personal data, e.g. for financial or medical purposes during internet communication, e.g. revealing personal data from cookies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a sensitive information detection method based on a graph neural network, which collects text data from the network and a user to construct a graph structure. And taking the bag-of-words model of the text as nodes of the graph, and calculating the Jacard similarity of the text as the weight of the edges between the nodes. In the training stage, the user designates sensitive information and homomorphic encryption is carried out on the sensitive information provided by the user so as to protect the privacy of the user. The graph structure is built for model training from user-specified sensitive information and information collected over the network. In the detection stage, data are obtained by scanning the information sharing platform and the hacker website, and the data are preprocessed. And adding the collected data as nodes into the graph structure to obtain a new graph structure, and detecting by using a graph neural network. According to the invention, the relation among all the documents is constructed through the graph structure, so that the labor cost is reduced. Meanwhile, privacy of user sensitive data is protected through a homomorphic encryption method for the user data.

Description

Sensitive information detection method based on graph neural network
Technical Field
The invention relates to the field of information security, relates to a sensitive information monitoring technology, and in particular relates to a sensitive information detection method based on a graph neural network.
Background
The enterprise sensitive data contains sensitive information of the user, and once the enterprise sensitive data is leaked, huge economic loss is brought to the enterprise, and the user is also bothered. Therefore, how to guarantee the security of enterprise sensitive data becomes the key point of enterprise information protection work. The traditional method uses manual comparison, has high labor cost and poor flexibility, and can leak sensitive information to detection personnel. With the development of deep learning, some students use a word vector method to detect sensitive information. However, the content-based detection method ignores the relation between texts, and easily leads to high false alarm rate of the model. The invention constructs the relation between texts through the graph structure, simultaneously carries out homomorphic encryption on data, and carries out calculation under the condition of protecting the privacy of users. The method solves the problem of over-high false alarm rate caused by the relation between the leakage of sensitive information and the lack of text in the detection process in the prior method.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention aims to provide a sensitive information detection method based on a graph neural network, and the relation among all documents is constructed through a graph structure, so that the problems of high missing report rate and false report rate in the traditional method are solved, and the labor cost is reduced. Meanwhile, privacy of user sensitive data is protected through a homomorphic encryption method for the user data.
In order to achieve the above object, the present invention is realized by the following technical scheme: a sensitive information detection method based on a graph neural network comprises the following steps:
1. the training set is collected and the training data is divided into two parts. The first part is non-sensitive information collected at the code sharing platform by crawlers and humans. Analyzing the collected information by using Beau fulSoup library, and filtering to only leave text information and setting the label as (0, 1) T . The second part is sensitive information provided by the user, and the label is set to (1,0,0,0) T ,(0,1,0,0) T ,(0,0,1,0) T Indicating high, medium, and low sensitivity levels, respectively.
2. And constructing a graph structure according to the training set for training. In the graph structure, a bag-of-words model of the text in the training set is used as a node in the graph structure, and the Jacard similarity between the texts is used as the weight of edges between the nodes. The specific formula is as follows:
Figure BDA0004031588330000021
wherein D is i ,D j Representing a set of words obtained after segmentation of two documents, |D i I represents word set D i Number of Chinese words, |D i ∩D j I represents word set D i And D j Word number of intersection, |D i ∪D j I represents word set D i And D j Word number of union, |D i -D j I represents word set D i And D j The number of words in the difference set, α, is a super parameter used to adjust the penalty size introduced by the different document lengths.
3. Training the graph neural network, and training the graph neural network by using the graph structure constructed by the training set. Training of the graph neural network is performed by sampling the subgraph until the loss function is not falling.
4. The detection data are collected, the data in the information sharing platform are collected through a crawler and manual method, and the source of the collected information and the collection time are recorded. After parsing using the beautifulso library, only text information is retained. The detection data and training data are used together to construct a graph structure for detection.
5. And (4) classifying nodes of the graph structure in the step (4) by using a trained graph neural network.
6. Judging whether the detected data has sensitive data or not, if no sensitive information exists, not performing any processing, and if any sensitive information of any sensitive level is found, recording the sensitive level of the sensitive information. And (4) generating a sensitive information record according to the source and the collection time of the data in the step 4.
The specific recording structure is as follows:
sensitive information numbering Sensitive information level Sensitive information sources Sensitive information collection time
Preferably, in order to protect the privacy of the sensitive information of the user, the sensitive information of step 1 is encrypted by using homomorphic encryption technology.
Preferably, the model of the step 3 is trained by using a sampling sub-graph mode. The updating of the nodes occurs in one sub-graph, not the entire graph. The subgraph is composed of neighbor nodes obtained by random sampling of all neighbors of the node.
Preferably, the loss function of the step 3 is CrossEntropy Loss. The specific formula is as follows:
Figure BDA0004031588330000022
n is the total number of nodes, y i True label, a, representing node i i The predictive label, σ, representing node i is the softmax activation function.
The invention has the following beneficial effects:
1. the mode of deep learning is used for replacing manual detection, so that the labor cost is greatly reduced.
2. The graph structure is used for representing the whole text set, so that the relation among the texts can be constructed, the false alarm rate is reduced, and the detection effect of the model is improved. When the test chart structure is constructed, a plurality of documents can be added at the same time, so that the detection efficiency is greatly improved.
3. Using homomorphic encryption, the privacy of user data is protected while the user can specify sensitive information.
Drawings
The invention is described in detail below with reference to the drawings and the detailed description;
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of the neural network training of the present invention;
fig. 3 is an overall architecture diagram of the present invention.
Detailed Description
The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
Referring to fig. 1-3, the present embodiment adopts the following technical scheme: a sensitive information detection method based on a graph neural network comprises the following steps:
step 101: information in code hosting platforms, net disks, libraries, hack forums and darknets is collected by means of crawlers and manual collection, and the source of the collected information and the time of collection are recorded.
Step 102: the crawled data is often in an HTML format, and the data is analyzed through BeautiffulSoup and filtered to obtain text information.
Step 103: the graph structure is built through the processed data, the data are encrypted homomorphically, the bag-of-words model of the document is used as the nodes in the graph structure, the weight of the edges between the nodes is defined through calculating the Jacord similarity between the documents, and the specific formula is as follows:
Figure BDA0004031588330000031
step 104: the graph is input into a graph neural network,and classifying the input graph nodes through the graph neural network, and finally outputting the prediction labels of each node. Label (1, 0) T ,(0,1,0,0) T ,(0,0,1,0) T ,(0,0,0,1) T Respectively representing high-sensitive information, medium-sensitive information, low-sensitive information and non-sensitive information. The input of the model is a graph, nodes and edges are added into the graph structure constructed in the training process to form the graph structure for testing, and 20 nodes are added each time, so that the model can detect whether 20 documents contain sensitive information or not simultaneously.
Step 105: and judging whether the input document has sensitive information, if not, not performing any processing, and if found to have sensitive information with any sensitive level, recording the sensitive level of the sensitive information.
Step 106: generating a sensitive information record according to the source and the collection time of the sensitive data obtained in the step 101, wherein the specific record structure is as follows:
sensitive information numbering Sensitive information level Sensitive information sources Sensitive information collection time
Step 107: and judging whether the detection needs to be continued, if the graph structure needs to be built again by using the data for detection, otherwise, exiting the program.
Example 1: as shown in fig. 2, the user input and the data collected by the network construct a graph structure, the graph neural network is trained by sampling subgraphs until the model converges, the process comprising the steps of:
step 401A: the user designates sensitive information for matching and defines the sensitivity level of the sensitive information as (1, 0) from top to bottom according to the one-hot encoding mode T ,(0,1,0,0) T ,(0,0,1,0) T
Step 401B: information in code hosting platforms, network disks, libraries, hacker forums, and darknets is collected. The collected information is common non-sensitive information, and the label is set as (0, 1) T
Steps 402A, 402B: and (3) cleaning the data, wherein most of the data obtained by using the crawler are of an HTML structure, analyzing the data by using BeautiflulSoup, and filtering to obtain text information.
Step 403A: in order to protect the data privacy of the user, the user data is homomorphic encrypted. Homomorphic encryption is used as a multiparty secure computing technology, and after the result obtained by processing the ciphertext is decrypted, the plaintext can be computed under the condition of encrypting the plaintext as the result obtained by processing the plaintext.
Step 404: constructing a graph structure, wherein nodes of the graph are word bag models of texts, and the weights of edges between the nodes are defined by calculating the Jacard similarity between documents, wherein the concrete formula is as follows:
Figure BDA0004031588330000051
step 405: and training the graph neural network through the constructed graph structure, and updating the characteristics of the nodes by using a sampling subgraph in order to avoid overlarge memory consumption during training. Each update of a node is within a sub-graph of the entire graph structure, rather than the entire graph. By the method, the memory required by training can be reduced, and the accuracy of the model for detecting the nodes which are not seen can be improved. The update formula is as follows:
Figure BDA0004031588330000052
step 406: training until the model converges, and storing the model for detecting sensitive information. Model convergence is defined as the loss function being sufficiently low and not falling, the loss function being specifically formulated as follows:
Figure BDA0004031588330000053
n is the total number of nodes, y i True label, a, representing node i i Representing the predictive label of node i.
Based on the sensitive information detection method realized by the embodiment of the invention, the tedious and highly repeated manual detection is converted into automatic detection through the graph neural network, so that the user privacy can be protected while the cost can be reduced, the detection accuracy is improved, and the false alarm rate is reduced. A plurality of texts to be detected are added in one diagram, so that a plurality of texts can be detected simultaneously, and the detection efficiency is greatly improved.
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (5)

1. The sensitive information detection method based on the graph neural network is characterized by comprising the following steps of:
(1) Collecting a training set, wherein training data is divided into two parts; the first part is non-sensitive information collected on the code sharing platform by a crawler and manually; analyzing the collected information by using Beau fulSoup library, and filtering to only leave text information and setting the label as (0, 1) T The method comprises the steps of carrying out a first treatment on the surface of the The second part is sensitive information provided by the user, set the label to (1, 0) T ,(0,1,0,0) T ,(0,0,1,0) T Respectively representing high, medium and low sensitivity levels;
(2) Constructing a graph structure according to the training set for training; in the graph structure, a bag-of-words model of the text in the training set is used as a node in the graph structure, and the Jacard similarity between the texts is used as the weight of edges between the nodes;
(3) Training a graph neural network, wherein the graph neural network is trained by using a graph structure constructed by a training set; training the graph neural network in a way of sampling the subgraph until the loss function is not reduced;
(4) Collecting detection data, collecting data in an information sharing platform by a crawler and manual method, and recording the source of the collected information and the collection time; after analysis is carried out by using the BeautiffulSoup library, only text information is reserved; constructing a graph structure for detection using the detection data together with the training data;
(5) Performing node classification on the graph structure in the step 4 by using a trained graph neural network;
(6) Judging whether the detected data has sensitive data or not, if no sensitive information exists, not performing any processing, and if any sensitive information of any sensitive level is found, recording the sensitive level of the sensitive information; generating a record of sensitive information based on the source and collection time of the data of step (4).
2. The sensitive information detection method based on the graph neural network according to claim 1, wherein the specific formula of the step (2) is as follows:
Figure FDA0004031588320000011
wherein D is i ,D j Representing a set of words obtained after segmentation of two documents, |D i I represents word set D i Number of Chinese words, |D i ∩D j I represents word set D i And D j Word number of intersection, |D i ∪D j I meaning wordSet D i And D j Word number of union, |D i -D j I represents word set D i And D j The number of words in the difference set, α, is a super parameter used to adjust the penalty size introduced by the different document lengths.
3. The method for detecting sensitive information based on a neural network according to claim 1, wherein the sensitive information in step (1) is encrypted by using homomorphic encryption technology in order to protect the privacy of the sensitive information of the user.
4. The sensitive information detection method based on a graph neural network according to claim 1, wherein the model in the step (3) is trained by using a sampling subgraph; the updating of the nodes occurs in one sub-graph, not the entire graph; the subgraph is composed of neighbor nodes obtained by random sampling of all neighbors of the node.
5. The method for detecting sensitive information based on a graph neural network according to claim 1, wherein the loss function of the step (3) is CrossEntropy Loss; the specific formula is as follows:
Figure FDA0004031588320000021
n is the total number of nodes, y i True label, a, representing node i i The predictive label, σ, representing node i is the softmax activation function.
CN202211743964.6A 2022-12-30 2022-12-30 Sensitive information detection method based on graph neural network Pending CN116244738A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211743964.6A CN116244738A (en) 2022-12-30 2022-12-30 Sensitive information detection method based on graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211743964.6A CN116244738A (en) 2022-12-30 2022-12-30 Sensitive information detection method based on graph neural network

Publications (1)

Publication Number Publication Date
CN116244738A true CN116244738A (en) 2023-06-09

Family

ID=86623445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211743964.6A Pending CN116244738A (en) 2022-12-30 2022-12-30 Sensitive information detection method based on graph neural network

Country Status (1)

Country Link
CN (1) CN116244738A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359372A (en) * 2008-09-26 2009-02-04 腾讯科技(深圳)有限公司 Training method and device of classifier, and method apparatus for recognising sensitization picture
CN107526785A (en) * 2017-07-31 2017-12-29 广州市香港科大霍英东研究院 File classification method and device
US20200236120A1 (en) * 2019-01-17 2020-07-23 International Business Machines Corporation Detecting and mitigating risk in a transport network
CN113254620A (en) * 2021-06-21 2021-08-13 中国平安人寿保险股份有限公司 Response method, device and equipment based on graph neural network and storage medium
CN113378859A (en) * 2021-06-29 2021-09-10 中国科学技术大学 Interpretable image privacy detection method
CN113946682A (en) * 2021-12-21 2022-01-18 北京大学 Sensitive text detection method and system based on adaptive graph neural network
US20220335340A1 (en) * 2021-09-24 2022-10-20 Intel Corporation Systems, apparatus, articles of manufacture, and methods for data usage monitoring to identify and mitigate ethical divergence
CN115309931A (en) * 2022-08-10 2022-11-08 齐鲁工业大学 Paper text classification method and system based on graph neural network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359372A (en) * 2008-09-26 2009-02-04 腾讯科技(深圳)有限公司 Training method and device of classifier, and method apparatus for recognising sensitization picture
CN107526785A (en) * 2017-07-31 2017-12-29 广州市香港科大霍英东研究院 File classification method and device
US20200236120A1 (en) * 2019-01-17 2020-07-23 International Business Machines Corporation Detecting and mitigating risk in a transport network
CN113254620A (en) * 2021-06-21 2021-08-13 中国平安人寿保险股份有限公司 Response method, device and equipment based on graph neural network and storage medium
CN113378859A (en) * 2021-06-29 2021-09-10 中国科学技术大学 Interpretable image privacy detection method
US20220335340A1 (en) * 2021-09-24 2022-10-20 Intel Corporation Systems, apparatus, articles of manufacture, and methods for data usage monitoring to identify and mitigate ethical divergence
CN113946682A (en) * 2021-12-21 2022-01-18 北京大学 Sensitive text detection method and system based on adaptive graph neural network
CN115309931A (en) * 2022-08-10 2022-11-08 齐鲁工业大学 Paper text classification method and system based on graph neural network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
TEJ BAHADUR CHANDRA等: "Disease Localization and Severity Assessment in Chest X-Ray Images using Multi-Stage Superpixels Classification", 《ELSEVIER SCIENCE》, 9 June 2022 (2022-06-09), pages 1 - 24 *
孙龙;李彦;: "基于MapReduce并行计算提取文档特征Textrank算法研究", 现代信息科技, no. 10, 18 October 2018 (2018-10-18) *
李艳秋: "基于集成学习的人脸识别研究", 《万方数据论文库》, 3 December 2018 (2018-12-03), pages 1 - 139 *
谭建豪;章兢;李伟雄;: "密度分布函数在聚类算法中的应应用用", 控制理论与应用, no. 12, 15 December 2011 (2011-12-15) *

Similar Documents

Publication Publication Date Title
Sun et al. Detecting anomalous user behavior using an extended isolation forest algorithm: an enterprise case study
CN112491796B (en) Intrusion detection and semantic decision tree quantitative interpretation method based on convolutional neural network
CN108549814A (en) A kind of SQL injection detection method based on machine learning, database security system
CN111523117A (en) Android malicious software detection and malicious code positioning system and method
CN105653956A (en) Android malicious software sorting method based on dynamic behavior dependency graph
Yang et al. Directed network community detection: A popularity and productivity link model
Li et al. ModelDiff: Testing-based DNN similarity comparison for model reuse detection
CN109670306A (en) Electric power malicious code detecting method, server and system based on artificial intelligence
CN110830489B (en) Method and system for detecting counterattack type fraud website based on content abstract representation
CN107403091A (en) A kind of combination is traced to the source path and the system for real-time intrusion detection of figure of tracing to the source
Narayanan et al. Contextual weisfeiler-lehman graph kernel for malware detection
CN109918901A (en) The method that real-time detection is attacked based on Cache
CN111310185B (en) Android malicious software detection method based on improved stacking algorithm
CN113746780A (en) Abnormal host detection method, device, medium and equipment based on host image
CN116244738A (en) Sensitive information detection method based on graph neural network
CN110472416A (en) A kind of web virus detection method and relevant apparatus
CN113259369B (en) Data set authentication method and system based on machine learning member inference attack
Silalahi et al. A survey on process mining for security
CN111291376B (en) Web vulnerability verification method based on crowdsourcing and machine learning
Lee et al. Camp2Vec: Embedding cyber campaign with ATT&CK framework for attack group analysis
CN114372266A (en) Android malicious software detection method based on operation code graph
CN111314327A (en) Network intrusion detection method and system based on KNN outlier detection algorithm
CN111581640A (en) Malicious software detection method, device and equipment and storage medium
CN115065556B (en) Log malicious behavior detection method and system based on graph contrast learning
Venkataraman et al. Assessing the impact of network events with user feedback

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination