CN110826320A

CN110826320A - Sensitive data discovery method and system based on text recognition

Info

Publication number: CN110826320A
Application number: CN201911195301.3A
Authority: CN
Inventors: 殷钱安; 梁淑云; 刘胜; 马影; 陶景龙; 王启凡; 魏国富; 徐�明; 余贤喆; 周晓勇
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2020-02-21
Anticipated expiration: 2039-11-28
Also published as: CN110826320B

Abstract

The invention relates to a sensitive data discovery method based on text recognition, which comprises the following steps: s01, sample data extraction; s02, constructing a training sample, collecting a text data set, and constructing the training sample; s03, training a sample labeling model, acquiring a training sample based on S02, and training a text labeling model; s04, constructing data characteristics; s05, constructing a training set, wherein the label of the data set obtained in the S04 is depicted to form the training set for constructing a classification judgment model; s06, constructing a classification judgment model, and forming a variable prediction model according to the training set obtained in S05; and S07, testing the model. According to the method, the sensitive data can be accurately and efficiently judged and recognized under the condition that the data dictionary and the matching rule are incomplete through the recognition of the data variable, and the consistency of recognition and classification results is ensured.

Description

Sensitive data discovery method and system based on text recognition

Technical Field

The invention relates to the technical field of data security, in particular to a sensitive data discovery method and system based on text recognition.

Background

Data is a supporting foundation for enterprise operation and also a core part of an enterprise information system, once a problem occurs in a data-related management and application system, the image and development of an enterprise are seriously affected, and therefore, the problem of data security is always a subject of great concern for the enterprise. At present, data protection schemes in practical application mainly comprise data isolation, authority setting, data desensitization and the like. The protection of sensitive data is particularly important in a data protection scheme, and the core part of the sensitive data protection scheme is to select the sensitive data from massive data to finish accurate identification of the sensitive data.

At present, the identification of sensitive data mainly depends on a dictionary matching method and a manual identification method.

For example, the application number cn201910600215.x discloses a data center data checking system, and specifically discloses that data is matched one by manually defining a pattern matching formula of sensitive data, and when the data is found to meet the pattern matching formula, the data is defined as the sensitive data. The matching target can be data metadata or data content. Whereas the manual identification method relies mainly on the personal experience of the risk assessor and a predefined sensitive data dictionary. Risk evaluators typically determine empirically which definitions in a model belong to sensitive data based on predefined data models, such as database design models, file system organizational structures, etc., and then discover and identify sensitive data in the form of data samples among the sensitive data.

The sensitive data dictionary matching method has the following defects that 1, the recognition accuracy is low, and the dictionary matching adopts a mode of patterned matching, so that the establishment of a data dictionary determines the recognition accuracy of sensitive data, and when the dictionary is incomplete or the dictionary is wrongly established, the recognition accuracy is reduced; 2. the classification result is interfered, because dictionary matching is adopted, the same data information can be matched with a plurality of data dictionaries, and because the traditional data dictionaries are not subjected to weighting calculation, the interference of the classification result can be caused, and the inaccuracy of the classification result is caused.

The method for manually identifying the sensitive data has the following defects that 1, the identification speed is low, and because a manual processing mode is adopted, the period of the manual carding speed is longer than that of the machine identification speed when a large amount of data is faced, and the requirement on the professional quality of processing personnel is higher; 2. in the text log data, the similarity between texts is very high, the accuracy of a dictionary matching method is low, the recognition capability is not high, and the matching rule needs continuous optimization of related personnel along with the change of the data.

Disclosure of Invention

The invention aims to solve the technical problems of low speed and low precision of sensitive data identification in the prior art.

The invention solves the technical problems through the following technical means:

a sensitive data discovery method based on text recognition comprises the following steps:

s01, sample data extraction, extracting a standardized service data table within the specified time as original sample data;

s02, constructing training samples, collecting a text data set, labeling the keywords in the text data set by using a text labeling tool, and constructing a large number of training samples;

s03, training a sample labeling model, based on the training samples obtained in S02, training the text labeling model by using a two-way long and short memory network and a conditional random field;

s04, data feature construction, namely constructing feature variables for describing data features by combining the text annotation model obtained in S03 on the basis of original sample data of S01;

s05, constructing a training set, defining a sensitive data label, and depicting the label of the data set obtained in the S04 to form the training set for constructing a classification judgment model;

s06, constructing a classification judgment model, and performing variable classification model training by using a catboost algorithm according to the training set obtained in the S05 to form a variable prediction model;

and S07, testing a model, namely describing and constructing the characteristic of unidentified data in the production environment based on the characteristic construction mode of S04 to form a prediction set, and judging whether the variable is a sensitive field and the type of the sensitive field by utilizing the classification judgment model obtained in S06.

According to the method, the sensitive data can be accurately and efficiently judged and recognized under the condition that the data dictionary and the matching rule are incomplete through the recognition of the data variable, and the consistency of recognition and classification results is ensured.

Preferably, in S02, the key words in the text data set are labeled by using a BIO labeling method: labeling each element as "B-X", "I-X", or "O"; wherein "X" indicates that the annotation element belongs to the type, "B-X" indicates that the fragment in which the element is located belongs to the type X and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to the type X and the element is in the middle position of the fragment, and "O" indicates that the element does not belong to any type.

Preferably, in S03, the text annotation model includes a representation layer, a Bi-LSTM layer, and a CRF layer, wherein

Presentation layer: each unit in the sentence represents a vector formed by embedding characters or words;

Bi-LSTM layer: using the word embedding or word embedding vector obtained by the presentation layer as the input of each time step of the bidirectional LSTM; outputting respective scores of all labels of each word of the sentence through bidirectional LSTM layer training;

CRF layer: the layer uses the output of the Bi-LSTM layer, namely the emission probability matrix of all labels of each word, and the transition probability matrix calculated based on the text as the parameters of the original CRF model, finally obtains the probability of the label sequence, and selects the label class with the maximum probability value, namely the category of each word.

Preferably, in S06, the variable classification model is trained by using a Catboost algorithm, specifically, the variable classification model is trained by using a Catboost algorithm

The gradient lifting tree model is represented as an additive model of the decision tree:

wherein T (x; theta)_m) Represents a decision tree, θ_mM is the number of the decision tree as a parameter of the decision tree;

firstly, determining an initial lifting tree f by adopting a forward distribution algorithm₀(x) The model at step m is 0:

f_m(x)＝f_m-1(x)+T(x；θ_m)

parameters for the next tree were determined by empirical risk minimization:

GBDT uses greedy targeting statistical method to process the average value of the labels corresponding to the typing variables for replacement, and the expression is:

where n denotes the number of samples, x_i,kValue, Y, representing the k characteristic of the ith record_iIndicating the target label value corresponding to the ith record.

Preferably, the method of Catboost for improving Greedy TBS is to add a prior distribution term to reduce the influence of noise and low frequency data on data distribution:

where P is the added prior term and a is typically a weighting factor greater than 0.

The invention also provides a sensitive data discovery system based on text recognition, which comprises

The sample data extraction module extracts a standardized service data table within specified time as original sample data;

a training sample module is constructed, a text data set is collected, keywords in the text data set are labeled by a text labeling tool, and a large number of training samples are constructed;

the training sample labeling model module is used for training a text labeling model by utilizing a bidirectional long and short memory network and a conditional random field based on the obtained training sample;

the data characteristic construction module is used for constructing characteristic variables for describing data characteristics by combining the obtained text labeling model on the basis of original sample data;

the training set construction module is used for defining a sensitive data label, and performing label portrayal on the obtained data set to form a training set for constructing a classification judgment model;

constructing a classification judgment model module, and performing variable classification model training by using a catboost algorithm according to the obtained training set to form a variable prediction model;

and the model testing module is used for describing and constructing the characteristics of the unidentified data based on the characteristic construction mode of the data characteristic construction module to form a prediction set, and judging whether the variable is the sensitive field and the type of the sensitive field by using the obtained classification judgment model.

Preferably, in the building of the training sample module, a BIO labeling method is adopted to label the keywords in the text data set: labeling each element as "B-X", "I-X", or "O"; wherein "X" indicates that the annotation element belongs to the type, "B-X" indicates that the fragment in which the element is located belongs to the type X and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to the type X and the element is in the middle position of the fragment, and "O" indicates that the element does not belong to any type.

Preferably, in the training sample labeling model module, the text labeling model comprises a representation layer, a Bi-LSTM layer and a CRF layer, wherein

Preferably, in the model module for constructing a classification judgment, a variable classification model is trained by using a Catboost algorithm, specifically

f_m(x)＝f_m-1(x)+T(x；θ_m)

parameters for the next tree were determined by empirical risk minimization:

Preferably, the method of Catboost for improving Greedy TBS is to add a prior distribution term, where the expression is:

where P is the added prior term and a is a weighting factor greater than 0.

The invention has the advantages that: under the condition that a data dictionary and a matching rule are incomplete, text fields are difficult to distinguish, and classification results are interfered, so that the classification results are inaccurate.

Drawings

FIG. 1 is a block diagram showing the flow of a method in embodiment 1 of the present invention;

fig. 2 is a flow chart of a process executed by the method in embodiment 1 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

The embodiment provides a sensitive data discovery method based on text recognition, which specifically includes the following steps, as shown in fig. 1:

s01: sample data extraction

And extracting a standardized service data table within a specified time period (day/month) as original sample data.

S02: text labeling process

A large amount of text corpora are collected, a text labeling tool is used, a BIO labeling method is used for manually labeling key words in the text corpora, and a large amount of training samples are constructed.

BIO labeling: each element is labeled "B-X", "I-X", or "O". Wherein "X" indicates that the annotation element belongs to the type, "B-X" indicates that the fragment in which the element is located belongs to the type X and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to the type X and the element is in the middle position of the fragment, and "O" indicates that the element does not belong to any type.

S03: training text label model

According to the labeled text data obtained in S02, a bidirectional long short-Term Memory (Bi-LSTM) network and a Conditional Random Field (CRF) network are used for model training. The whole algorithm model structure is divided into a representation layer, a Bi-LSTM layer and a CRF layer.

First layer (representing layer): each unit in a sentence represents a vector consisting of word insertions or word insertions. Wherein, the word embedding is initialized randomly, and the word embedding is obtained through data training. All embeddings are adjusted to be optimal during the training process.

Second layer (Bi-LSTM layer): and taking the word embedding or word embedding vector obtained by the first layer as the input of each time step of the bidirectional LSTM. Through two-way LSTM layer training, the respective scores of all the tags for each word of the sentence are output.

Third layer (CRF layer): the layer uses the respective scores of all labels of each word output by the Bi-LSTM layer, namely an emission probability matrix and a transition probability matrix calculated based on the text as parameters of an original CRF model, finally obtains the probability of a label sequence, and selects the label class with the maximum probability value, namely the category of each word.

S04: data feature construction

Based on the original data obtained in S01, feature variables for describing features of the data are constructed, which mainly include two aspects of features:

the method comprises the following steps of firstly, analyzing type characteristics summarized based on the characteristics of data, wherein the type characteristics comprise data length, whether the data are numerical types, whether special characters are contained, non-numerical character proportion and the like;

secondly, text labeling is carried out on text type variable contents by using an S03 text labeling model, and corresponding variable characteristics are constructed according to the labeled text types, wherein the corresponding variable characteristics comprise the number of text participles, the number of labeling types of words contained in the text, the proportion of labeling types of words contained in the text (such as names of people, places, names of countries and organizations), the number of words with various lengths and the like.

S05: training set sensitive class label extraction

And (4) performing label portrayal on the data set obtained in the step (S04) based on business experience and sensitive data labels defined by corresponding industries to form a training set for constructing a classification model.

S06: constructing a classification judgment model

And (4) performing variable classification model training by using a catboost algorithm according to the training set constructed in the S05 to form a variable prediction model, and storing the model so as to facilitate real-time calling of the model. The Catboost algorithm has the advantage that a processing method of a typing variable is optimized on the basis of a gradient lifting tree (GBDT).

wherein T (x; theta)_m) Represents a decision tree, θ_mM is the number of trees as a parameter of the decision tree.

f_m(x)＝f_m-1(x)+T(x；θ_m)

parameters for the next tree were determined by empirical risk minimization:

GBDT is replaced by processing the average value of the labels corresponding to the type-divided variables by a Greedy Target-based Statistics method (Greedy Target-based Statistics), and the expression is as follows:

the greedy targeting statistical method has an obvious defect that the features generally contain more information than the labels, and if the average values of the labels are forced to represent the features, the problem of condition deviation occurs when the data structures and the distributions of the training data set and the test data set are different, so that overfitting is caused, and the classification effect of the model is poor.

The Catboost adopts a standard manner of improving Greedy TBS by adding a prior distribution term to reduce the influence of noise and low-frequency data on data distribution:

where n denotes the number of samples, x_i,kValue, Y, representing the k characteristic of the ith record_iAnd (3) representing the target label value corresponding to the ith record, wherein P is an added prior term, and a is a weight coefficient which is usually greater than 0. For category number comparisonWith few features, adding a priori terms may also reduce noisy data.

S07: identifying data variables based on model

And describing and constructing the features of the unidentified data in the production environment based on the feature construction mode of S04 to form a prediction set, and judging whether the variables are sensitive fields and the types of the sensitive fields by using the classification model obtained in S06.

Example 2

Correspondingly, the embodiment also provides a sensitive data discovery system based on text recognition, which is characterized in that: comprises that

a training sample module is constructed, a text data set is collected, keywords in the text data set are labeled by a text labeling tool, and a large number of training samples are constructed; BIO labeling: each element is labeled "B-X", "I-X", or "O". Wherein "X" indicates that the annotation element belongs to the type, "B-X" indicates that the fragment in which the element is located belongs to the type X and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to the type X and the element is in the middle position of the fragment, and "O" indicates that the element does not belong to any type.

The training sample labeling model module is used for training a text labeling model by utilizing a bidirectional long and short memory network and a conditional random field based on the obtained training sample; according to the labeled text data obtained by constructing a training sample module, a bidirectional Long Short-Term Memory (Bi-LSTM) network and a Conditional Random Field (CRF) network are utilized to carry out model training. The whole algorithm model structure is divided into a representation layer, a Bi-LSTM layer and a CRF layer.

The data characteristic construction module is used for constructing characteristic variables for describing data characteristics by combining the obtained text labeling model on the basis of original sample data; in particular to

and secondly, performing text labeling on text type variable contents by using a text labeling model, and constructing corresponding variable characteristics according to the labeled text types, wherein the variable characteristics comprise the number of text participles, the number of labeling types of words contained in the text, the labeling type proportion of words contained in the text (such as names of people, places, countries and organizations), the number of words with different lengths and the like.

constructing a classification judgment model module, and performing variable classification model training by using a catboost algorithm according to the obtained training set to form a variable prediction model; for real-time invocation of the model. The Catboost algorithm has the advantage that a processing method of a typing variable is optimized on the basis of a gradient lifting tree (GBDT).

f_m(x)＝f_m-1(x)+T(x；θ_m)

parameters for the next tree were determined by empirical risk minimization:

where n denotes the number of samples, x_i,kValue, Y, representing the k characteristic of the ith record_iAnd (3) representing the target label value corresponding to the ith record, wherein P is an added prior term, and a is a weight coefficient which is usually greater than 0. For features with a small number of classes, adding a priori terms may also reduce noise data.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A sensitive data discovery method based on text recognition is characterized by comprising the following steps: the method comprises the following steps:

s02, constructing a training sample, collecting a text data set, and marking the keywords in the text data set by using a text marking tool to construct the training sample;

2. The sensitive data discovery method based on text recognition according to claim 1, wherein: in S02, labeling the keywords in the text data set by using a BIO labeling method: labeling each element as "B-X", "I-X", or "O"; wherein "X" indicates that the annotation element belongs to the type, "B-X" indicates that the fragment in which the element is located belongs to the type X and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to the type X and the element is in the middle position of the fragment, and "O" indicates that the element does not belong to any type.

3. The sensitive data discovery method based on text recognition according to claim 2, wherein: in the step S03, the text annotation model includes a representation layer, a Bi-LSTM layer, and a CRF layer, wherein the representation layer, the Bi-LSTM layer, and the CRF layer are included in the text annotation model

4. The sensitive data discovery method based on text recognition according to claim 1, wherein: in the step S06, a Catboost algorithm is used for carrying out variable classification model training, specifically, the method comprises the step of carrying out variable classification model training

f_m(x)＝f_m-1(x)+T(x；θ_m)

parameters for the next tree were determined by empirical risk minimization:

5. The sensitive data discovery method based on text recognition according to claim 4, wherein: the method for improving Greedy TBS by Catboost is to add a prior distribution term, and the expression is as follows:

where P is the added prior term and a is a weighting factor greater than 0.

6. A sensitive data discovery system based on text recognition, characterized by: comprises that

7. The sensitive data discovery system based on text recognition of claim 6, wherein: in the training sample building module, a BIO labeling method is adopted to label the keywords in the text data set: labeling each element as "B-X", "I-X", or "O"; wherein "X" indicates that the annotation element belongs to the type, "B-X" indicates that the fragment in which the element is located belongs to the type X and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to the type X and the element is in the middle position of the fragment, and "O" indicates that the element does not belong to any type.

8. The sensitive data discovery system based on text recognition of claim 7, wherein: in the training sample labeling model module, the text labeling model comprises a representation layer, a Bi-LSTM layer and a CRF layer, wherein the representation layer, the Bi-LSTM layer and the CRF layer are arranged in the text labeling model module

Presentation layer: each unit in the sentence represents a vector formed by embedding characters or words; the word embedding is initialized randomly, and the word embedding is obtained through data training; all embeddings are adjusted to be optimal in the training process;

9. The sensitive data discovery system based on text recognition of claim 6, wherein: in the model module for constructing the classification judgment model, a Catboost algorithm is utilized to train a variable classification model, specifically

f_m(x)＝f_m-1(x)+T(x；θ_m)

parameters for the next tree were determined by empirical risk minimization:

10. A sensitive data discovery system based on text recognition according to claim 9, wherein: the method for improving Greedy TBS by Catboost is to add a prior distribution term, and the expression is as follows:

where P is the added prior term and a is a weighting factor greater than 0.