CN113672915A

CN113672915A - Machine learning-based data leakage prevention system

Info

Publication number: CN113672915A
Application number: CN202111221497.6A
Authority: CN
Inventors: 韩旭东; 王鹤; 吴明; 蒋荣; 邹建宇; 郑海树
Original assignee: Nanjing Zhongfu Information Technology Co Ltd
Current assignee: Nanjing Zhongfu Information Technology Co Ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2021-11-19

Abstract

The invention discloses a set of data leakage prevention system based on machine learning in the technical field of network data leakage prevention systems, which comprises a preprocessing module; a semantic classified similarity calculation module; and a secret-related degree judgment module. The invention combines semantic analysis, semantic fuzzy matching and pattern recognition algorithm with confidential document detection, changes the method of using keywords to hit in the traditional confidential document detection, and realizes the confidential document detection with wider range and more detection methods.

Description

Machine learning-based data leakage prevention system

Technical Field

The invention relates to the technical field of network data leakage protection systems, in particular to a set of data leakage prevention system based on machine learning.

Background

The existing data anti-leakage system framework diagram (as the description attached with figure 1) has the operation process that: a user uses a client to send a request to a target server through a proxy server, the proxy server caches data transmitted on a network to a local file, the data is sent to a file detection service for strategy matching, the data is processed according to whether a strategy is hit and behaviors (blocking, alarming and releasing) after the strategy is hit, the file detection service is the core of a data leakage detection processing method, the service adopts a keyword regular matching mode, the file detection service carries out protocol recovery and accurate keyword matching on the file transmitted to the local by the network according to keywords set by an administrator at a management end, and alarming is carried out if the keywords are hit, and leakage behaviors are reported.

The data detection technology used by the existing data leakage prevention system adopts an accurate matching idea, processes whether a file is leaked or not, and uses a keyword hit matching method. The administrator sets a relevant keyword issuing strategy at the management end, and when relevant keywords exist in the confidential files, the alarm is given an alarm to block data. A file detection service module of the existing data leakage prevention system detects data leakage, related keywords appear explicitly in confidential files, if the confidential files convert sentences according to semantics, keywords in a strategy are not mentioned, the confidential files are not alarmed, and the risk of data leakage appears.

Based on the above, the invention designs a set of data leakage prevention system based on machine learning to solve the above problems.

Disclosure of Invention

The invention aims to provide a data leakage prevention system based on machine learning to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a set of data leakage prevention system based on machine learning comprises

The preprocessing module is used for preprocessing the sentences, performing vector segmentation on the sentences by using a word2vec algorithm and dividing the sentences into word vector small sentences;

the semantic security similarity calculation module judges and calculates the security similarity of semantic security by combining a fuzzy matching algorithm and a K neighbor algorithm;

and the wading density judging module is used for comparing the wading density similarity calculated by the semantic wading density similarity calculating module with a specified threshold value and blocking the alarm of the file with the wading density similarity exceeding the specified threshold value calculated by the semantic wading density similarity calculating module.

As a further scheme of the present invention, the method for judging and calculating the secret-related similarity by the semantic secret-related similarity calculation module is as follows:

s2-1: constructing a word vector space for the statement, and constructing a vector space for the semantic set;

s2-2: matching the word vector space in the S2-1 by using a neural network algorithm, and converting the corresponding word vector space in the S2-1 into a corresponding sequence of the sememes;

s2-3: and constructing a semantic fuzzy matching set from the semantic sequence obtained in the step S2-2 by combining a parameter-free machine learning method, and performing encryption detection and judgment on the encryption degree by using a fuzzy matching algorithm aiming at the converted semantic sequence.

As a further scheme of the invention, the mapping process from the word vector to the semantic meaning in the S2-2 is realized by adopting a neural network algorithm, the detection module establishes a training corpus with reference to the "national diary labeled corpus", the training corpus and the Hownet knowledge base are combined, the words in the training corpus are used as input, and the Hownet semantic label is used as a label to train the neural network, so that the problems of storage, traversal and the like of a large-scale knowledge base in the traditional method are avoided; the neural network can map the input word vector into a certain semantic vector in a semantic space to complete the conversion from the word sequence to the semantic sequence; after the text to be detected is input, the word sequence of the text is input into a neural network through preprocessing, and is converted into an semantic sequence, so that the semantic mapping work is completed.

As a further scheme of the present invention, the S2-2 integrates multiple long-and-short-term memory network models by using a Bagging algorithm, so as to expand a prediction range of the models, improve a prediction capability, and obtain more prediction primitives meeting context information.

As a further scheme of the present invention, the long-short term memory network model integrated by Bagging in S2-2 is specifically a double-layer long-short term memory network model sharing hidden state information.

As a further scheme of the present invention, the S2-3 measures the similarity of two sememe vectors by using euclidean distance, and performs fuzzy matching on the predicted sememes by using a K-nearest neighbor algorithm.

Compared with the prior art, the invention has the beneficial effects that:

1. the method combines semantic analysis, semantic fuzzy matching and pattern recognition algorithm with confidential document detection, changes the method of using keywords to hit in the traditional confidential document detection, and realizes the confidential document detection with wider range and more detection methods;

2. the method combines a training corpus and a Hownet knowledge base, takes words in the training corpus as input, labels the Hownet sememe as a label training neural network, and obtains sub data sets with different distributions by sampling the training corpus for multiple times;

3. the invention trains the sub data sets to obtain a differentiated double-layer structure long-time memory network. Different long and short time memory networks are integrated through a Bagging algorithm to obtain a deep neural network integration model capable of making time sequence prediction on a semantic matching sequence.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a prior art data leak prevention system block diagram;

FIG. 2 is a flow chart of a method for determining and calculating a secret-related similarity by a semantic secret-related similarity calculation module according to the present invention;

FIG. 3 is a flow chart of a machine learning-based data leakage prevention system according to the present invention;

FIG. 4 is a double-layer long-and-short term memory network model for sharing hidden state information in the present invention;

FIG. 5 is a block diagram of a process for implementing the machine learning document detection system of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-5, the present invention provides a technical solution: the data leakage prevention system based on machine learning comprises a preprocessing module, wherein the preprocessing module is used for preprocessing sentences, performing vector segmentation on the sentences by using word2vec algorithm and dividing the sentences into word vector small sentences;

the wading density judging module is used for comparing the wading density similarity calculated by the semantic wading density similarity calculating module with a specified threshold value and blocking the alarm of the files of which the wading density similarity calculated by the semantic wading density similarity calculating module exceeds the specified threshold value;

the novel data leakage prevention system based on machine learning provides a semantic leakage judgment processing method based on machine learning technology, neural network technology and fuzzy matching technology aiming at the defect that whether the existing system needs explicit keywords to judge whether the existing system is leaked or not, the secret data file is judged according to semantics by training a neural network model, even if the files do not have explicit keywords, the novel data leakage prevention system based on machine learning can recognize the semantics of the files, and the alarm can be stopped as long as the semantics of the sentences of the files accord with the defined meanings of the leaked keywords.

The key component machine learning file detection system of the invention is realized by the following steps:

preprocessing a sentence, performing vector segmentation on the sentence by using a word2vec algorithm, and segmenting the sentence into word vector small sentences;

and constructing a word vector space for the sentence and a vector space for the semantic set. Matching a word vector space and an ambiguity set vector space by adopting a neural network method, integrating a plurality of long-time and short-time memory network models by using a Bagging algorithm, and expanding the semantic judgment range by using an integrated model;

a fuzzy matching set of the semantic is constructed by combining the K nearest neighbor algorithm, so that the semantic judgment precision is improved;

and comparing the calculated secret-related similarity with a specified threshold, and blocking the alarm by the file exceeding the threshold.

Key technology analysis:

(1) a pretreatment stage:

in a vector space model for text analysis, a word is described by a vector, such as the most common One-hot representation.

One obvious disadvantage of the One-hot representation method is that no association is established from word to word.

In deep learning, a Word is generally described by Distributed Representation, which is often called "Word Representation" or "Word Embedding", that is, we colloquially called "Word vector".

The machine learning file detection system cuts word vectors of the sentences by adopting a word2vec algorithm in the preprocessing stage of the sentences;

the reason for adopting the word2vec algorithm is that the word2vec algorithm is simpler to create a word cutting model, and the difference compared with ffnnlm is mainly reflected in that: the model is simpler, the hidden layer in ffnnlm is removed, and the connection from the input layer to the output layer directly after jumping over the hidden layer is removed.

The training language model predicts the mth word by using the first n words of the mth word, and the training word vector predicts the mth word by using the front and back n words, so that the context is really used for prediction.

(2) And a semantic matching stage:

the basic flow of the semantic matching stage is as follows:

constructing a word vector space for the statement, and constructing a vector space for the semantic set;

matching the word vector space by using a neural network algorithm to convert the corresponding word vector space into a corresponding semantic sequence;

and (4) constructing an ambiguity matching set of the sememe by combining a K nearest neighbor algorithm, and performing standard secret detection on the converted sememe sequence by using the ambiguity matching algorithm to judge the secret-involved degree.

The mapping process from word vectors to semantics is realized by adopting a neural network algorithm, the detection module establishes a training corpus according to 'national diary labeled corpus', the training corpus and the Hownet knowledge base are combined, the words in the training corpus are used as input, and the Hownet semantic labels are used as labels to train the neural network, so that the problems of storage, traversal and the like of a large-scale knowledge base in the traditional method are solved. The neural network can map an input word vector into a certain semantic vector in a semantic space, and complete conversion from a word sequence to a semantic sequence. After the text to be detected is input, the word sequence of the text is input into a neural network through preprocessing, and is converted into an semantic sequence, so that the semantic mapping work is completed.

Because the classified text detection is a semantic sequence oriented to complete sentences, the length of the sequence is not fixed, and the occurrence probability of the long sequence is very high, a model capable of storing long-time sequence information needs to be adopted, and a long-time memory network relatively meets the requirements. The long-time and short-time memory network model has more complete storage of time sequence information and stronger learning capacity of long-time sequence data, and compared with the traditional language model based on the knowledge base, the long-time and short-time memory network model learns the time sequence change of information instead of the information, so the requirement on the knowledge base is greatly reduced. By continuously accumulating the time sequence information, the long-time and short-time memory network model can extract more context semantic features, and by learning the features, the long-time and short-time memory network can analyze and judge the input semantic sequence according to the context information superposed by the time sequence. Therefore, the long-time and short-time memory network model is more suitable for constructing a natural language model for semantic feature sequence input. Therefore, the system integrates a plurality of long-time and short-time memory network models by adopting a Bagging algorithm, enlarges the prediction range of the models, improves the prediction capability and obtains more prediction primitives meeting context information.

After the sequence of the sememes is input into the integration model, the integration model inputs the sequence into each submodel, and the submodels predict collocation sememes of the next time sequence of the input sememes according to the learning experience of the submodels and the current context information.

In the training process, the learning data of the sub-models are different, and the tendency of fitting the data is also different, so that the method finds more matched sememes meeting the current context according to the differentiated characteristic. Although the long-and-short term memory network can reduce the influence of gradient disappearance and gradient explosion to a certain extent, when long-distance memory information is processed, the situation that the memory information is explosively increased due to excessive redundant information is inevitable, and in order to deal with the situation, the system provides a double-layer long-and-short term memory network model sharing hidden state information.

(3) Similarity calculation by fuzzy matching algorithm

In the stage (2), the system adopts a deep learning method to complete the mapping conversion operation of the words and the sememes, and the method sacrifices partial precision in exchange for optimization of time complexity and space complexity. The precision loss may have a large influence on the semantic collocation prediction part, so in order to improve the prediction precision, besides the multi-model integrated prediction model, a fuzzy matching method is also required to be combined to increase the range of the semantic prediction, so as to improve the prediction precision. The fuzzy matching method is usually performed by similarity, a semantic vector space is constructed according to the How Net knowledge base in the previous work, the vector space is an embedded feature space using a high-dimensional semantic space, all words in a corpus can find corresponding points in the semantic space, and in the word vector space, the closer the distance between two word vectors or the higher the similarity, the closer the semantics of the two words are. On the basis, the system measures the similarity of two sememe vectors by adopting Euclidean distance, and performs fuzzy matching on the predicted sememes by adopting a parameter-free machine learning method;

the machine learning method comprises a parameter-free method and a parameter-free method, wherein the parameter-free method comprises a K nearest neighbor algorithm (KNN) K-MEANS clustering method, a DBSCAN density clustering method and the like, the parameter-free method has the advantages that pre-training is not needed, raw data are directly processed, the parameter-free method is an important learning method in machine learning, the K nearest neighbor algorithm is adopted for fuzzy matching, and Euclidean distance is used for measuring the similarity between primitive vectors.

The technical innovation point of the invention lies in that the technology for judging whether the file is divulged is updated, and the application of the machine learning technology and the neural network technology to the field of data leakage prevention is a leading realization method in the industry; the new system judges whether the file is divulged or not only is not limited to simple keyword matching judgment, but adopts a Chinese semantic matching model, trains a sentence semantic automatic conversion matching model through a neural network, judges the semantic security by combining a fuzzy matching algorithm and a K neighbor algorithm and calculates the similarity of the security, and finally judges the file is divulged when the similarity exceeds a threshold value;

the novel data leakage-proof system adopts machine learning and neural network technologies, establishes a Chinese semantic processing model with self-learning, self-decomposition semantics and self-judgment, avoids the defects existing in detection only by keywords, enlarges the detection range and the detection precision, can realize secret-related judgment on fuzzy articles without keywords through semantic analysis, and has better effect on secret leakage prevention.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. One set of data leakage prevention system based on machine learning, its characterized in that: comprises that

2. The set of machine learning-based data leak prevention systems of claim 1, wherein: the method for judging and calculating the secret-related similarity by the semantic secret-related similarity calculation module comprises the following steps:

3. The set of machine learning-based data leak prevention systems of claim 2, wherein: the mapping process from the word vector to the semantic meaning in the S2-2 is realized by adopting a neural network algorithm, the detection module establishes a training corpus according to the 'national daily newspaper marking corpus', the training corpus and the Hownet knowledge base are combined, the words in the training corpus are used as input, the Hownet semantic annotation is used as a label to train the neural network, and the problems of storage, traversal and the like of a large-scale knowledge base in the traditional method are avoided; the neural network can map the input word vector into a certain semantic vector in a semantic space to complete the conversion from the word sequence to the semantic sequence; after the text to be detected is input, the word sequence of the text is input into a neural network through preprocessing, and is converted into an semantic sequence, so that the semantic mapping work is completed.

4. The set of machine learning-based data leak prevention systems of claim 3, wherein: and S2-2 integrates a plurality of long-time and short-time memory network models by adopting a Bagging algorithm, so that the prediction range of the models is expanded, the prediction capability is improved, and more prediction primitives meeting context information are obtained.

5. The set of machine learning-based data leak prevention systems of claim 4, wherein: the long-time and short-time memory network model integrated by Bagging is specifically a double-layer long-time and short-time memory network model sharing hidden state information in the S2-2.

6. The set of machine learning-based data leak prevention systems of claim 2, wherein: and S2-3, measuring the similarity of the two sememe vectors by adopting Euclidean distance, and carrying out fuzzy matching on the predicted sememes by adopting a K nearest neighbor algorithm.