CN116244738A - Sensitive information detection method based on graph neural network - Google Patents
Sensitive information detection method based on graph neural network Download PDFInfo
- Publication number
- CN116244738A CN116244738A CN202211743964.6A CN202211743964A CN116244738A CN 116244738 A CN116244738 A CN 116244738A CN 202211743964 A CN202211743964 A CN 202211743964A CN 116244738 A CN116244738 A CN 116244738A
- Authority
- CN
- China
- Prior art keywords
- sensitive information
- graph
- data
- neural network
- sensitive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 30
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 27
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000000034 method Methods 0.000 claims abstract description 16
- 238000005070 sampling Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 3
- 230000035945 sensitivity Effects 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 230000011218 segmentation Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/008—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols involving homomorphic encryption
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6263—Protecting personal data, e.g. for financial or medical purposes during internet communication, e.g. revealing personal data from cookies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a sensitive information detection method based on a graph neural network, which collects text data from the network and a user to construct a graph structure. And taking the bag-of-words model of the text as nodes of the graph, and calculating the Jacard similarity of the text as the weight of the edges between the nodes. In the training stage, the user designates sensitive information and homomorphic encryption is carried out on the sensitive information provided by the user so as to protect the privacy of the user. The graph structure is built for model training from user-specified sensitive information and information collected over the network. In the detection stage, data are obtained by scanning the information sharing platform and the hacker website, and the data are preprocessed. And adding the collected data as nodes into the graph structure to obtain a new graph structure, and detecting by using a graph neural network. According to the invention, the relation among all the documents is constructed through the graph structure, so that the labor cost is reduced. Meanwhile, privacy of user sensitive data is protected through a homomorphic encryption method for the user data.
Description
Technical Field
The invention relates to the field of information security, relates to a sensitive information monitoring technology, and in particular relates to a sensitive information detection method based on a graph neural network.
Background
The enterprise sensitive data contains sensitive information of the user, and once the enterprise sensitive data is leaked, huge economic loss is brought to the enterprise, and the user is also bothered. Therefore, how to guarantee the security of enterprise sensitive data becomes the key point of enterprise information protection work. The traditional method uses manual comparison, has high labor cost and poor flexibility, and can leak sensitive information to detection personnel. With the development of deep learning, some students use a word vector method to detect sensitive information. However, the content-based detection method ignores the relation between texts, and easily leads to high false alarm rate of the model. The invention constructs the relation between texts through the graph structure, simultaneously carries out homomorphic encryption on data, and carries out calculation under the condition of protecting the privacy of users. The method solves the problem of over-high false alarm rate caused by the relation between the leakage of sensitive information and the lack of text in the detection process in the prior method.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention aims to provide a sensitive information detection method based on a graph neural network, and the relation among all documents is constructed through a graph structure, so that the problems of high missing report rate and false report rate in the traditional method are solved, and the labor cost is reduced. Meanwhile, privacy of user sensitive data is protected through a homomorphic encryption method for the user data.
In order to achieve the above object, the present invention is realized by the following technical scheme: a sensitive information detection method based on a graph neural network comprises the following steps:
1. the training set is collected and the training data is divided into two parts. The first part is non-sensitive information collected at the code sharing platform by crawlers and humans. Analyzing the collected information by using Beau fulSoup library, and filtering to only leave text information and setting the label as (0, 1) T . The second part is sensitive information provided by the user, and the label is set to (1,0,0,0) T ,(0,1,0,0) T ,(0,0,1,0) T Indicating high, medium, and low sensitivity levels, respectively.
2. And constructing a graph structure according to the training set for training. In the graph structure, a bag-of-words model of the text in the training set is used as a node in the graph structure, and the Jacard similarity between the texts is used as the weight of edges between the nodes. The specific formula is as follows:
wherein D is i ,D j Representing a set of words obtained after segmentation of two documents, |D i I represents word set D i Number of Chinese words, |D i ∩D j I represents word set D i And D j Word number of intersection, |D i ∪D j I represents word set D i And D j Word number of union, |D i -D j I represents word set D i And D j The number of words in the difference set, α, is a super parameter used to adjust the penalty size introduced by the different document lengths.
3. Training the graph neural network, and training the graph neural network by using the graph structure constructed by the training set. Training of the graph neural network is performed by sampling the subgraph until the loss function is not falling.
4. The detection data are collected, the data in the information sharing platform are collected through a crawler and manual method, and the source of the collected information and the collection time are recorded. After parsing using the beautifulso library, only text information is retained. The detection data and training data are used together to construct a graph structure for detection.
5. And (4) classifying nodes of the graph structure in the step (4) by using a trained graph neural network.
6. Judging whether the detected data has sensitive data or not, if no sensitive information exists, not performing any processing, and if any sensitive information of any sensitive level is found, recording the sensitive level of the sensitive information. And (4) generating a sensitive information record according to the source and the collection time of the data in the step 4.
The specific recording structure is as follows:
sensitive information numbering | Sensitive information level | Sensitive information sources | Sensitive information collection time |
Preferably, in order to protect the privacy of the sensitive information of the user, the sensitive information of step 1 is encrypted by using homomorphic encryption technology.
Preferably, the model of the step 3 is trained by using a sampling sub-graph mode. The updating of the nodes occurs in one sub-graph, not the entire graph. The subgraph is composed of neighbor nodes obtained by random sampling of all neighbors of the node.
Preferably, the loss function of the step 3 is CrossEntropy Loss. The specific formula is as follows:
n is the total number of nodes, y i True label, a, representing node i i The predictive label, σ, representing node i is the softmax activation function.
The invention has the following beneficial effects:
1. the mode of deep learning is used for replacing manual detection, so that the labor cost is greatly reduced.
2. The graph structure is used for representing the whole text set, so that the relation among the texts can be constructed, the false alarm rate is reduced, and the detection effect of the model is improved. When the test chart structure is constructed, a plurality of documents can be added at the same time, so that the detection efficiency is greatly improved.
3. Using homomorphic encryption, the privacy of user data is protected while the user can specify sensitive information.
Drawings
The invention is described in detail below with reference to the drawings and the detailed description;
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of the neural network training of the present invention;
fig. 3 is an overall architecture diagram of the present invention.
Detailed Description
The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
Referring to fig. 1-3, the present embodiment adopts the following technical scheme: a sensitive information detection method based on a graph neural network comprises the following steps:
step 101: information in code hosting platforms, net disks, libraries, hack forums and darknets is collected by means of crawlers and manual collection, and the source of the collected information and the time of collection are recorded.
Step 102: the crawled data is often in an HTML format, and the data is analyzed through BeautiffulSoup and filtered to obtain text information.
Step 103: the graph structure is built through the processed data, the data are encrypted homomorphically, the bag-of-words model of the document is used as the nodes in the graph structure, the weight of the edges between the nodes is defined through calculating the Jacord similarity between the documents, and the specific formula is as follows:
step 104: the graph is input into a graph neural network,and classifying the input graph nodes through the graph neural network, and finally outputting the prediction labels of each node. Label (1, 0) T ,(0,1,0,0) T ,(0,0,1,0) T ,(0,0,0,1) T Respectively representing high-sensitive information, medium-sensitive information, low-sensitive information and non-sensitive information. The input of the model is a graph, nodes and edges are added into the graph structure constructed in the training process to form the graph structure for testing, and 20 nodes are added each time, so that the model can detect whether 20 documents contain sensitive information or not simultaneously.
Step 105: and judging whether the input document has sensitive information, if not, not performing any processing, and if found to have sensitive information with any sensitive level, recording the sensitive level of the sensitive information.
Step 106: generating a sensitive information record according to the source and the collection time of the sensitive data obtained in the step 101, wherein the specific record structure is as follows:
sensitive information numbering | Sensitive information level | Sensitive information sources | Sensitive information collection time |
Step 107: and judging whether the detection needs to be continued, if the graph structure needs to be built again by using the data for detection, otherwise, exiting the program.
Example 1: as shown in fig. 2, the user input and the data collected by the network construct a graph structure, the graph neural network is trained by sampling subgraphs until the model converges, the process comprising the steps of:
Step 404: constructing a graph structure, wherein nodes of the graph are word bag models of texts, and the weights of edges between the nodes are defined by calculating the Jacard similarity between documents, wherein the concrete formula is as follows:
step 405: and training the graph neural network through the constructed graph structure, and updating the characteristics of the nodes by using a sampling subgraph in order to avoid overlarge memory consumption during training. Each update of a node is within a sub-graph of the entire graph structure, rather than the entire graph. By the method, the memory required by training can be reduced, and the accuracy of the model for detecting the nodes which are not seen can be improved. The update formula is as follows:
step 406: training until the model converges, and storing the model for detecting sensitive information. Model convergence is defined as the loss function being sufficiently low and not falling, the loss function being specifically formulated as follows:
n is the total number of nodes, y i True label, a, representing node i i Representing the predictive label of node i.
Based on the sensitive information detection method realized by the embodiment of the invention, the tedious and highly repeated manual detection is converted into automatic detection through the graph neural network, so that the user privacy can be protected while the cost can be reduced, the detection accuracy is improved, and the false alarm rate is reduced. A plurality of texts to be detected are added in one diagram, so that a plurality of texts can be detected simultaneously, and the detection efficiency is greatly improved.
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (5)
1. The sensitive information detection method based on the graph neural network is characterized by comprising the following steps of:
(1) Collecting a training set, wherein training data is divided into two parts; the first part is non-sensitive information collected on the code sharing platform by a crawler and manually; analyzing the collected information by using Beau fulSoup library, and filtering to only leave text information and setting the label as (0, 1) T The method comprises the steps of carrying out a first treatment on the surface of the The second part is sensitive information provided by the user, set the label to (1, 0) T ,(0,1,0,0) T ,(0,0,1,0) T Respectively representing high, medium and low sensitivity levels;
(2) Constructing a graph structure according to the training set for training; in the graph structure, a bag-of-words model of the text in the training set is used as a node in the graph structure, and the Jacard similarity between the texts is used as the weight of edges between the nodes;
(3) Training a graph neural network, wherein the graph neural network is trained by using a graph structure constructed by a training set; training the graph neural network in a way of sampling the subgraph until the loss function is not reduced;
(4) Collecting detection data, collecting data in an information sharing platform by a crawler and manual method, and recording the source of the collected information and the collection time; after analysis is carried out by using the BeautiffulSoup library, only text information is reserved; constructing a graph structure for detection using the detection data together with the training data;
(5) Performing node classification on the graph structure in the step 4 by using a trained graph neural network;
(6) Judging whether the detected data has sensitive data or not, if no sensitive information exists, not performing any processing, and if any sensitive information of any sensitive level is found, recording the sensitive level of the sensitive information; generating a record of sensitive information based on the source and collection time of the data of step (4).
2. The sensitive information detection method based on the graph neural network according to claim 1, wherein the specific formula of the step (2) is as follows:
wherein D is i ,D j Representing a set of words obtained after segmentation of two documents, |D i I represents word set D i Number of Chinese words, |D i ∩D j I represents word set D i And D j Word number of intersection, |D i ∪D j I meaning wordSet D i And D j Word number of union, |D i -D j I represents word set D i And D j The number of words in the difference set, α, is a super parameter used to adjust the penalty size introduced by the different document lengths.
3. The method for detecting sensitive information based on a neural network according to claim 1, wherein the sensitive information in step (1) is encrypted by using homomorphic encryption technology in order to protect the privacy of the sensitive information of the user.
4. The sensitive information detection method based on a graph neural network according to claim 1, wherein the model in the step (3) is trained by using a sampling subgraph; the updating of the nodes occurs in one sub-graph, not the entire graph; the subgraph is composed of neighbor nodes obtained by random sampling of all neighbors of the node.
5. The method for detecting sensitive information based on a graph neural network according to claim 1, wherein the loss function of the step (3) is CrossEntropy Loss; the specific formula is as follows:
n is the total number of nodes, y i True label, a, representing node i i The predictive label, σ, representing node i is the softmax activation function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211743964.6A CN116244738A (en) | 2022-12-30 | 2022-12-30 | Sensitive information detection method based on graph neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211743964.6A CN116244738A (en) | 2022-12-30 | 2022-12-30 | Sensitive information detection method based on graph neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116244738A true CN116244738A (en) | 2023-06-09 |
Family
ID=86623445
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211743964.6A Pending CN116244738A (en) | 2022-12-30 | 2022-12-30 | Sensitive information detection method based on graph neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116244738A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101359372A (en) * | 2008-09-26 | 2009-02-04 | 腾讯科技(深圳)有限公司 | Training method and device of classifier, and method apparatus for recognising sensitization picture |
CN107526785A (en) * | 2017-07-31 | 2017-12-29 | 广州市香港科大霍英东研究院 | File classification method and device |
US20200236120A1 (en) * | 2019-01-17 | 2020-07-23 | International Business Machines Corporation | Detecting and mitigating risk in a transport network |
CN113254620A (en) * | 2021-06-21 | 2021-08-13 | 中国平安人寿保险股份有限公司 | Response method, device and equipment based on graph neural network and storage medium |
CN113378859A (en) * | 2021-06-29 | 2021-09-10 | 中国科学技术大学 | Interpretable image privacy detection method |
CN113946682A (en) * | 2021-12-21 | 2022-01-18 | 北京大学 | Sensitive text detection method and system based on adaptive graph neural network |
US20220335340A1 (en) * | 2021-09-24 | 2022-10-20 | Intel Corporation | Systems, apparatus, articles of manufacture, and methods for data usage monitoring to identify and mitigate ethical divergence |
CN115309931A (en) * | 2022-08-10 | 2022-11-08 | 齐鲁工业大学 | Paper text classification method and system based on graph neural network |
-
2022
- 2022-12-30 CN CN202211743964.6A patent/CN116244738A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101359372A (en) * | 2008-09-26 | 2009-02-04 | 腾讯科技(深圳)有限公司 | Training method and device of classifier, and method apparatus for recognising sensitization picture |
CN107526785A (en) * | 2017-07-31 | 2017-12-29 | 广州市香港科大霍英东研究院 | File classification method and device |
US20200236120A1 (en) * | 2019-01-17 | 2020-07-23 | International Business Machines Corporation | Detecting and mitigating risk in a transport network |
CN113254620A (en) * | 2021-06-21 | 2021-08-13 | 中国平安人寿保险股份有限公司 | Response method, device and equipment based on graph neural network and storage medium |
CN113378859A (en) * | 2021-06-29 | 2021-09-10 | 中国科学技术大学 | Interpretable image privacy detection method |
US20220335340A1 (en) * | 2021-09-24 | 2022-10-20 | Intel Corporation | Systems, apparatus, articles of manufacture, and methods for data usage monitoring to identify and mitigate ethical divergence |
CN113946682A (en) * | 2021-12-21 | 2022-01-18 | 北京大学 | Sensitive text detection method and system based on adaptive graph neural network |
CN115309931A (en) * | 2022-08-10 | 2022-11-08 | 齐鲁工业大学 | Paper text classification method and system based on graph neural network |
Non-Patent Citations (4)
Title |
---|
TEJ BAHADUR CHANDRA等: "Disease Localization and Severity Assessment in Chest X-Ray Images using Multi-Stage Superpixels Classification", 《ELSEVIER SCIENCE》, 9 June 2022 (2022-06-09), pages 1 - 24 * |
孙龙;李彦;: "基于MapReduce并行计算提取文档特征Textrank算法研究", 现代信息科技, no. 10, 18 October 2018 (2018-10-18) * |
李艳秋: "基于集成学习的人脸识别研究", 《万方数据论文库》, 3 December 2018 (2018-12-03), pages 1 - 139 * |
谭建豪;章兢;李伟雄;: "密度分布函数在聚类算法中的应应用用", 控制理论与应用, no. 12, 15 December 2011 (2011-12-15) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sun et al. | Detecting anomalous user behavior using an extended isolation forest algorithm: an enterprise case study | |
CN112491796B (en) | Intrusion detection and semantic decision tree quantitative interpretation method based on convolutional neural network | |
CN108549814A (en) | A kind of SQL injection detection method based on machine learning, database security system | |
CN111523117A (en) | Android malicious software detection and malicious code positioning system and method | |
CN105653956A (en) | Android malicious software sorting method based on dynamic behavior dependency graph | |
Yang et al. | Directed network community detection: A popularity and productivity link model | |
Li et al. | ModelDiff: Testing-based DNN similarity comparison for model reuse detection | |
CN109670306A (en) | Electric power malicious code detecting method, server and system based on artificial intelligence | |
CN110830489B (en) | Method and system for detecting counterattack type fraud website based on content abstract representation | |
CN107403091A (en) | A kind of combination is traced to the source path and the system for real-time intrusion detection of figure of tracing to the source | |
Narayanan et al. | Contextual weisfeiler-lehman graph kernel for malware detection | |
CN109918901A (en) | The method that real-time detection is attacked based on Cache | |
CN111310185B (en) | Android malicious software detection method based on improved stacking algorithm | |
CN113746780A (en) | Abnormal host detection method, device, medium and equipment based on host image | |
CN116244738A (en) | Sensitive information detection method based on graph neural network | |
CN110472416A (en) | A kind of web virus detection method and relevant apparatus | |
CN113259369B (en) | Data set authentication method and system based on machine learning member inference attack | |
Silalahi et al. | A survey on process mining for security | |
CN111291376B (en) | Web vulnerability verification method based on crowdsourcing and machine learning | |
Lee et al. | Camp2Vec: Embedding cyber campaign with ATT&CK framework for attack group analysis | |
CN114372266A (en) | Android malicious software detection method based on operation code graph | |
CN111314327A (en) | Network intrusion detection method and system based on KNN outlier detection algorithm | |
CN111581640A (en) | Malicious software detection method, device and equipment and storage medium | |
CN115065556B (en) | Log malicious behavior detection method and system based on graph contrast learning | |
Venkataraman et al. | Assessing the impact of network events with user feedback |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |