CN110826320A - Sensitive data discovery method and system based on text recognition - Google Patents

Sensitive data discovery method and system based on text recognition Download PDF

Info

Publication number
CN110826320A
CN110826320A CN201911195301.3A CN201911195301A CN110826320A CN 110826320 A CN110826320 A CN 110826320A CN 201911195301 A CN201911195301 A CN 201911195301A CN 110826320 A CN110826320 A CN 110826320A
Authority
CN
China
Prior art keywords
model
data
training
text
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911195301.3A
Other languages
Chinese (zh)
Other versions
CN110826320B (en
Inventor
殷钱安
梁淑云
刘胜
马影
陶景龙
王启凡
魏国富
徐�明
余贤喆
周晓勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN201911195301.3A priority Critical patent/CN110826320B/en
Publication of CN110826320A publication Critical patent/CN110826320A/en
Application granted granted Critical
Publication of CN110826320B publication Critical patent/CN110826320B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

The invention relates to a sensitive data discovery method based on text recognition, which comprises the following steps: s01, sample data extraction; s02, constructing a training sample, collecting a text data set, and constructing the training sample; s03, training a sample labeling model, acquiring a training sample based on S02, and training a text labeling model; s04, constructing data characteristics; s05, constructing a training set, wherein the label of the data set obtained in the S04 is depicted to form the training set for constructing a classification judgment model; s06, constructing a classification judgment model, and forming a variable prediction model according to the training set obtained in S05; and S07, testing the model. According to the method, the sensitive data can be accurately and efficiently judged and recognized under the condition that the data dictionary and the matching rule are incomplete through the recognition of the data variable, and the consistency of recognition and classification results is ensured.

Description

Sensitive data discovery method and system based on text recognition
Technical Field
The invention relates to the technical field of data security, in particular to a sensitive data discovery method and system based on text recognition.
Background
Data is a supporting foundation for enterprise operation and also a core part of an enterprise information system, once a problem occurs in a data-related management and application system, the image and development of an enterprise are seriously affected, and therefore, the problem of data security is always a subject of great concern for the enterprise. At present, data protection schemes in practical application mainly comprise data isolation, authority setting, data desensitization and the like. The protection of sensitive data is particularly important in a data protection scheme, and the core part of the sensitive data protection scheme is to select the sensitive data from massive data to finish accurate identification of the sensitive data.
At present, the identification of sensitive data mainly depends on a dictionary matching method and a manual identification method.
For example, the application number cn201910600215.x discloses a data center data checking system, and specifically discloses that data is matched one by manually defining a pattern matching formula of sensitive data, and when the data is found to meet the pattern matching formula, the data is defined as the sensitive data. The matching target can be data metadata or data content. Whereas the manual identification method relies mainly on the personal experience of the risk assessor and a predefined sensitive data dictionary. Risk evaluators typically determine empirically which definitions in a model belong to sensitive data based on predefined data models, such as database design models, file system organizational structures, etc., and then discover and identify sensitive data in the form of data samples among the sensitive data.
The sensitive data dictionary matching method has the following defects that 1, the recognition accuracy is low, and the dictionary matching adopts a mode of patterned matching, so that the establishment of a data dictionary determines the recognition accuracy of sensitive data, and when the dictionary is incomplete or the dictionary is wrongly established, the recognition accuracy is reduced; 2. the classification result is interfered, because dictionary matching is adopted, the same data information can be matched with a plurality of data dictionaries, and because the traditional data dictionaries are not subjected to weighting calculation, the interference of the classification result can be caused, and the inaccuracy of the classification result is caused.
The method for manually identifying the sensitive data has the following defects that 1, the identification speed is low, and because a manual processing mode is adopted, the period of the manual carding speed is longer than that of the machine identification speed when a large amount of data is faced, and the requirement on the professional quality of processing personnel is higher; 2. in the text log data, the similarity between texts is very high, the accuracy of a dictionary matching method is low, the recognition capability is not high, and the matching rule needs continuous optimization of related personnel along with the change of the data.
Disclosure of Invention
The invention aims to solve the technical problems of low speed and low precision of sensitive data identification in the prior art.
The invention solves the technical problems through the following technical means:
a sensitive data discovery method based on text recognition comprises the following steps:
s01, sample data extraction, extracting a standardized service data table within the specified time as original sample data;
s02, constructing training samples, collecting a text data set, labeling the keywords in the text data set by using a text labeling tool, and constructing a large number of training samples;
s03, training a sample labeling model, based on the training samples obtained in S02, training the text labeling model by using a two-way long and short memory network and a conditional random field;
s04, data feature construction, namely constructing feature variables for describing data features by combining the text annotation model obtained in S03 on the basis of original sample data of S01;
s05, constructing a training set, defining a sensitive data label, and depicting the label of the data set obtained in the S04 to form the training set for constructing a classification judgment model;
s06, constructing a classification judgment model, and performing variable classification model training by using a catboost algorithm according to the training set obtained in the S05 to form a variable prediction model;
and S07, testing a model, namely describing and constructing the characteristic of unidentified data in the production environment based on the characteristic construction mode of S04 to form a prediction set, and judging whether the variable is a sensitive field and the type of the sensitive field by utilizing the classification judgment model obtained in S06.
According to the method, the sensitive data can be accurately and efficiently judged and recognized under the condition that the data dictionary and the matching rule are incomplete through the recognition of the data variable, and the consistency of recognition and classification results is ensured.
Preferably, in S02, the key words in the text data set are labeled by using a BIO labeling method: labeling each element as "B-X", "I-X", or "O"; wherein "X" indicates that the annotation element belongs to the type, "B-X" indicates that the fragment in which the element is located belongs to the type X and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to the type X and the element is in the middle position of the fragment, and "O" indicates that the element does not belong to any type.
Preferably, in S03, the text annotation model includes a representation layer, a Bi-LSTM layer, and a CRF layer, wherein
Presentation layer: each unit in the sentence represents a vector formed by embedding characters or words;
Bi-LSTM layer: using the word embedding or word embedding vector obtained by the presentation layer as the input of each time step of the bidirectional LSTM; outputting respective scores of all labels of each word of the sentence through bidirectional LSTM layer training;
CRF layer: the layer uses the output of the Bi-LSTM layer, namely the emission probability matrix of all labels of each word, and the transition probability matrix calculated based on the text as the parameters of the original CRF model, finally obtains the probability of the label sequence, and selects the label class with the maximum probability value, namely the category of each word.
Preferably, in S06, the variable classification model is trained by using a Catboost algorithm, specifically, the variable classification model is trained by using a Catboost algorithm
The gradient lifting tree model is represented as an additive model of the decision tree:
Figure BDA0002294522710000031
wherein T (x; theta)m) Represents a decision tree, θmM is the number of the decision tree as a parameter of the decision tree;
firstly, determining an initial lifting tree f by adopting a forward distribution algorithm0(x) The model at step m is 0:
fm(x)=fm-1(x)+T(x;θm)
parameters for the next tree were determined by empirical risk minimization:
Figure BDA0002294522710000032
GBDT uses greedy targeting statistical method to process the average value of the labels corresponding to the typing variables for replacement, and the expression is:
where n denotes the number of samples, xi,kValue, Y, representing the k characteristic of the ith recordiIndicating the target label value corresponding to the ith record.
Preferably, the method of Catboost for improving Greedy TBS is to add a prior distribution term to reduce the influence of noise and low frequency data on data distribution:
Figure BDA0002294522710000034
where P is the added prior term and a is typically a weighting factor greater than 0.
The invention also provides a sensitive data discovery system based on text recognition, which comprises
The sample data extraction module extracts a standardized service data table within specified time as original sample data;
a training sample module is constructed, a text data set is collected, keywords in the text data set are labeled by a text labeling tool, and a large number of training samples are constructed;
the training sample labeling model module is used for training a text labeling model by utilizing a bidirectional long and short memory network and a conditional random field based on the obtained training sample;
the data characteristic construction module is used for constructing characteristic variables for describing data characteristics by combining the obtained text labeling model on the basis of original sample data;
the training set construction module is used for defining a sensitive data label, and performing label portrayal on the obtained data set to form a training set for constructing a classification judgment model;
constructing a classification judgment model module, and performing variable classification model training by using a catboost algorithm according to the obtained training set to form a variable prediction model;
and the model testing module is used for describing and constructing the characteristics of the unidentified data based on the characteristic construction mode of the data characteristic construction module to form a prediction set, and judging whether the variable is the sensitive field and the type of the sensitive field by using the obtained classification judgment model.
Preferably, in the building of the training sample module, a BIO labeling method is adopted to label the keywords in the text data set: labeling each element as "B-X", "I-X", or "O"; wherein "X" indicates that the annotation element belongs to the type, "B-X" indicates that the fragment in which the element is located belongs to the type X and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to the type X and the element is in the middle position of the fragment, and "O" indicates that the element does not belong to any type.
Preferably, in the training sample labeling model module, the text labeling model comprises a representation layer, a Bi-LSTM layer and a CRF layer, wherein
Presentation layer: each unit in the sentence represents a vector formed by embedding characters or words;
Bi-LSTM layer: using the word embedding or word embedding vector obtained by the presentation layer as the input of each time step of the bidirectional LSTM; outputting respective scores of all labels of each word of the sentence through bidirectional LSTM layer training;
CRF layer: the layer uses the output of the Bi-LSTM layer, namely the emission probability matrix of all labels of each word, and the transition probability matrix calculated based on the text as the parameters of the original CRF model, finally obtains the probability of the label sequence, and selects the label class with the maximum probability value, namely the category of each word.
Preferably, in the model module for constructing a classification judgment, a variable classification model is trained by using a Catboost algorithm, specifically
The gradient lifting tree model is represented as an additive model of the decision tree:
Figure BDA0002294522710000051
wherein T (x; theta)m) Represents a decision tree, θmM is the number of the decision tree as a parameter of the decision tree;
firstly, determining an initial lifting tree f by adopting a forward distribution algorithm0(x) The model at step m is 0:
fm(x)=fm-1(x)+T(x;θm)
parameters for the next tree were determined by empirical risk minimization:
Figure BDA0002294522710000052
GBDT uses greedy targeting statistical method to process the average value of the labels corresponding to the typing variables for replacement, and the expression is:
Figure BDA0002294522710000053
where n denotes the number of samples, xi,kValue, Y, representing the k characteristic of the ith recordiIndicating the target label value corresponding to the ith record.
Preferably, the method of Catboost for improving Greedy TBS is to add a prior distribution term, where the expression is:
Figure BDA0002294522710000054
where P is the added prior term and a is a weighting factor greater than 0.
The invention has the advantages that: under the condition that a data dictionary and a matching rule are incomplete, text fields are difficult to distinguish, and classification results are interfered, so that the classification results are inaccurate.
Drawings
FIG. 1 is a block diagram showing the flow of a method in embodiment 1 of the present invention;
fig. 2 is a flow chart of a process executed by the method in embodiment 1 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
The embodiment provides a sensitive data discovery method based on text recognition, which specifically includes the following steps, as shown in fig. 1:
s01: sample data extraction
And extracting a standardized service data table within a specified time period (day/month) as original sample data.
S02: text labeling process
A large amount of text corpora are collected, a text labeling tool is used, a BIO labeling method is used for manually labeling key words in the text corpora, and a large amount of training samples are constructed.
BIO labeling: each element is labeled "B-X", "I-X", or "O". Wherein "X" indicates that the annotation element belongs to the type, "B-X" indicates that the fragment in which the element is located belongs to the type X and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to the type X and the element is in the middle position of the fragment, and "O" indicates that the element does not belong to any type.
S03: training text label model
According to the labeled text data obtained in S02, a bidirectional long short-Term Memory (Bi-LSTM) network and a Conditional Random Field (CRF) network are used for model training. The whole algorithm model structure is divided into a representation layer, a Bi-LSTM layer and a CRF layer.
First layer (representing layer): each unit in a sentence represents a vector consisting of word insertions or word insertions. Wherein, the word embedding is initialized randomly, and the word embedding is obtained through data training. All embeddings are adjusted to be optimal during the training process.
Second layer (Bi-LSTM layer): and taking the word embedding or word embedding vector obtained by the first layer as the input of each time step of the bidirectional LSTM. Through two-way LSTM layer training, the respective scores of all the tags for each word of the sentence are output.
Third layer (CRF layer): the layer uses the respective scores of all labels of each word output by the Bi-LSTM layer, namely an emission probability matrix and a transition probability matrix calculated based on the text as parameters of an original CRF model, finally obtains the probability of a label sequence, and selects the label class with the maximum probability value, namely the category of each word.
S04: data feature construction
Based on the original data obtained in S01, feature variables for describing features of the data are constructed, which mainly include two aspects of features:
the method comprises the following steps of firstly, analyzing type characteristics summarized based on the characteristics of data, wherein the type characteristics comprise data length, whether the data are numerical types, whether special characters are contained, non-numerical character proportion and the like;
secondly, text labeling is carried out on text type variable contents by using an S03 text labeling model, and corresponding variable characteristics are constructed according to the labeled text types, wherein the corresponding variable characteristics comprise the number of text participles, the number of labeling types of words contained in the text, the proportion of labeling types of words contained in the text (such as names of people, places, names of countries and organizations), the number of words with various lengths and the like.
S05: training set sensitive class label extraction
And (4) performing label portrayal on the data set obtained in the step (S04) based on business experience and sensitive data labels defined by corresponding industries to form a training set for constructing a classification model.
S06: constructing a classification judgment model
And (4) performing variable classification model training by using a catboost algorithm according to the training set constructed in the S05 to form a variable prediction model, and storing the model so as to facilitate real-time calling of the model. The Catboost algorithm has the advantage that a processing method of a typing variable is optimized on the basis of a gradient lifting tree (GBDT).
The gradient lifting tree model is represented as an additive model of the decision tree:
Figure BDA0002294522710000071
wherein T (x; theta)m) Represents a decision tree, θmM is the number of trees as a parameter of the decision tree.
Firstly, determining an initial lifting tree f by adopting a forward distribution algorithm0(x) The model at step m is 0:
fm(x)=fm-1(x)+T(x;θm)
parameters for the next tree were determined by empirical risk minimization:
Figure BDA0002294522710000081
GBDT is replaced by processing the average value of the labels corresponding to the type-divided variables by a Greedy Target-based Statistics method (Greedy Target-based Statistics), and the expression is as follows:
Figure BDA0002294522710000082
the greedy targeting statistical method has an obvious defect that the features generally contain more information than the labels, and if the average values of the labels are forced to represent the features, the problem of condition deviation occurs when the data structures and the distributions of the training data set and the test data set are different, so that overfitting is caused, and the classification effect of the model is poor.
The Catboost adopts a standard manner of improving Greedy TBS by adding a prior distribution term to reduce the influence of noise and low-frequency data on data distribution:
where n denotes the number of samples, xi,kValue, Y, representing the k characteristic of the ith recordiAnd (3) representing the target label value corresponding to the ith record, wherein P is an added prior term, and a is a weight coefficient which is usually greater than 0. For category number comparisonWith few features, adding a priori terms may also reduce noisy data.
S07: identifying data variables based on model
And describing and constructing the features of the unidentified data in the production environment based on the feature construction mode of S04 to form a prediction set, and judging whether the variables are sensitive fields and the types of the sensitive fields by using the classification model obtained in S06.
Example 2
Correspondingly, the embodiment also provides a sensitive data discovery system based on text recognition, which is characterized in that: comprises that
The sample data extraction module extracts a standardized service data table within specified time as original sample data;
a training sample module is constructed, a text data set is collected, keywords in the text data set are labeled by a text labeling tool, and a large number of training samples are constructed; BIO labeling: each element is labeled "B-X", "I-X", or "O". Wherein "X" indicates that the annotation element belongs to the type, "B-X" indicates that the fragment in which the element is located belongs to the type X and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to the type X and the element is in the middle position of the fragment, and "O" indicates that the element does not belong to any type.
The training sample labeling model module is used for training a text labeling model by utilizing a bidirectional long and short memory network and a conditional random field based on the obtained training sample; according to the labeled text data obtained by constructing a training sample module, a bidirectional Long Short-Term Memory (Bi-LSTM) network and a Conditional Random Field (CRF) network are utilized to carry out model training. The whole algorithm model structure is divided into a representation layer, a Bi-LSTM layer and a CRF layer.
First layer (representing layer): each unit in a sentence represents a vector consisting of word insertions or word insertions. Wherein, the word embedding is initialized randomly, and the word embedding is obtained through data training. All embeddings are adjusted to be optimal during the training process.
Second layer (Bi-LSTM layer): and taking the word embedding or word embedding vector obtained by the first layer as the input of each time step of the bidirectional LSTM. Through two-way LSTM layer training, the respective scores of all the tags for each word of the sentence are output.
Third layer (CRF layer): the layer uses the respective scores of all labels of each word output by the Bi-LSTM layer, namely an emission probability matrix and a transition probability matrix calculated based on the text as parameters of an original CRF model, finally obtains the probability of a label sequence, and selects the label class with the maximum probability value, namely the category of each word.
The data characteristic construction module is used for constructing characteristic variables for describing data characteristics by combining the obtained text labeling model on the basis of original sample data; in particular to
The method comprises the following steps of firstly, analyzing type characteristics summarized based on the characteristics of data, wherein the type characteristics comprise data length, whether the data are numerical types, whether special characters are contained, non-numerical character proportion and the like;
and secondly, performing text labeling on text type variable contents by using a text labeling model, and constructing corresponding variable characteristics according to the labeled text types, wherein the variable characteristics comprise the number of text participles, the number of labeling types of words contained in the text, the labeling type proportion of words contained in the text (such as names of people, places, countries and organizations), the number of words with different lengths and the like.
The training set construction module is used for defining a sensitive data label, and performing label portrayal on the obtained data set to form a training set for constructing a classification judgment model;
constructing a classification judgment model module, and performing variable classification model training by using a catboost algorithm according to the obtained training set to form a variable prediction model; for real-time invocation of the model. The Catboost algorithm has the advantage that a processing method of a typing variable is optimized on the basis of a gradient lifting tree (GBDT).
The gradient lifting tree model is represented as an additive model of the decision tree:
Figure BDA0002294522710000101
wherein T (x; theta)m) Represents a decision tree, θmM is the number of trees as a parameter of the decision tree.
Firstly, determining an initial lifting tree f by adopting a forward distribution algorithm0(x) The model at step m is 0:
fm(x)=fm-1(x)+T(x;θm)
parameters for the next tree were determined by empirical risk minimization:
Figure BDA0002294522710000102
GBDT is replaced by processing the average value of the labels corresponding to the type-divided variables by a Greedy Target-based Statistics method (Greedy Target-based Statistics), and the expression is as follows:
Figure BDA0002294522710000103
the greedy targeting statistical method has an obvious defect that the features generally contain more information than the labels, and if the average values of the labels are forced to represent the features, the problem of condition deviation occurs when the data structures and the distributions of the training data set and the test data set are different, so that overfitting is caused, and the classification effect of the model is poor.
The Catboost adopts a standard manner of improving Greedy TBS by adding a prior distribution term to reduce the influence of noise and low-frequency data on data distribution:
where n denotes the number of samples, xi,kValue, Y, representing the k characteristic of the ith recordiAnd (3) representing the target label value corresponding to the ith record, wherein P is an added prior term, and a is a weight coefficient which is usually greater than 0. For features with a small number of classes, adding a priori terms may also reduce noise data.
And the model testing module is used for describing and constructing the characteristics of the unidentified data based on the characteristic construction mode of the data characteristic construction module to form a prediction set, and judging whether the variable is the sensitive field and the type of the sensitive field by using the obtained classification judgment model.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A sensitive data discovery method based on text recognition is characterized by comprising the following steps: the method comprises the following steps:
s01, sample data extraction, extracting a standardized service data table within the specified time as original sample data;
s02, constructing a training sample, collecting a text data set, and marking the keywords in the text data set by using a text marking tool to construct the training sample;
s03, training a sample labeling model, based on the training samples obtained in S02, training the text labeling model by using a two-way long and short memory network and a conditional random field;
s04, data feature construction, namely constructing feature variables for describing data features by combining the text annotation model obtained in S03 on the basis of original sample data of S01;
s05, constructing a training set, defining a sensitive data label, and depicting the label of the data set obtained in the S04 to form the training set for constructing a classification judgment model;
s06, constructing a classification judgment model, and performing variable classification model training by using a catboost algorithm according to the training set obtained in the S05 to form a variable prediction model;
and S07, testing a model, namely describing and constructing the characteristic of unidentified data in the production environment based on the characteristic construction mode of S04 to form a prediction set, and judging whether the variable is a sensitive field and the type of the sensitive field by utilizing the classification judgment model obtained in S06.
2. The sensitive data discovery method based on text recognition according to claim 1, wherein: in S02, labeling the keywords in the text data set by using a BIO labeling method: labeling each element as "B-X", "I-X", or "O"; wherein "X" indicates that the annotation element belongs to the type, "B-X" indicates that the fragment in which the element is located belongs to the type X and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to the type X and the element is in the middle position of the fragment, and "O" indicates that the element does not belong to any type.
3. The sensitive data discovery method based on text recognition according to claim 2, wherein: in the step S03, the text annotation model includes a representation layer, a Bi-LSTM layer, and a CRF layer, wherein the representation layer, the Bi-LSTM layer, and the CRF layer are included in the text annotation model
Presentation layer: each unit in the sentence represents a vector formed by embedding characters or words;
Bi-LSTM layer: using the word embedding or word embedding vector obtained by the presentation layer as the input of each time step of the bidirectional LSTM; outputting respective scores of all labels of each word of the sentence through bidirectional LSTM layer training;
CRF layer: the layer uses the output of the Bi-LSTM layer, namely the emission probability matrix of all labels of each word, and the transition probability matrix calculated based on the text as the parameters of the original CRF model, finally obtains the probability of the label sequence, and selects the label class with the maximum probability value, namely the category of each word.
4. The sensitive data discovery method based on text recognition according to claim 1, wherein: in the step S06, a Catboost algorithm is used for carrying out variable classification model training, specifically, the method comprises the step of carrying out variable classification model training
The gradient lifting tree model is represented as an additive model of the decision tree:
Figure FDA0002294522700000021
wherein T (x; theta)m) Represents a decision tree, θmM is the number of the decision tree as a parameter of the decision tree;
firstly, determining an initial lifting tree f by adopting a forward distribution algorithm0(x) The model at step m is 0:
fm(x)=fm-1(x)+T(x;θm)
parameters for the next tree were determined by empirical risk minimization:
Figure FDA0002294522700000022
GBDT uses greedy targeting statistical method to process the average value of the labels corresponding to the typing variables for replacement, and the expression is:
Figure FDA0002294522700000023
where n denotes the number of samples, xi,kValue, Y, representing the k characteristic of the ith recordiIndicating the target label value corresponding to the ith record.
5. The sensitive data discovery method based on text recognition according to claim 4, wherein: the method for improving Greedy TBS by Catboost is to add a prior distribution term, and the expression is as follows:
where P is the added prior term and a is a weighting factor greater than 0.
6. A sensitive data discovery system based on text recognition, characterized by: comprises that
The sample data extraction module extracts a standardized service data table within specified time as original sample data;
a training sample module is constructed, a text data set is collected, keywords in the text data set are labeled by a text labeling tool, and a large number of training samples are constructed;
the training sample labeling model module is used for training a text labeling model by utilizing a bidirectional long and short memory network and a conditional random field based on the obtained training sample;
the data characteristic construction module is used for constructing characteristic variables for describing data characteristics by combining the obtained text labeling model on the basis of original sample data;
the training set construction module is used for defining a sensitive data label, and performing label portrayal on the obtained data set to form a training set for constructing a classification judgment model;
constructing a classification judgment model module, and performing variable classification model training by using a catboost algorithm according to the obtained training set to form a variable prediction model;
and the model testing module is used for describing and constructing the characteristics of the unidentified data based on the characteristic construction mode of the data characteristic construction module to form a prediction set, and judging whether the variable is the sensitive field and the type of the sensitive field by using the obtained classification judgment model.
7. The sensitive data discovery system based on text recognition of claim 6, wherein: in the training sample building module, a BIO labeling method is adopted to label the keywords in the text data set: labeling each element as "B-X", "I-X", or "O"; wherein "X" indicates that the annotation element belongs to the type, "B-X" indicates that the fragment in which the element is located belongs to the type X and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to the type X and the element is in the middle position of the fragment, and "O" indicates that the element does not belong to any type.
8. The sensitive data discovery system based on text recognition of claim 7, wherein: in the training sample labeling model module, the text labeling model comprises a representation layer, a Bi-LSTM layer and a CRF layer, wherein the representation layer, the Bi-LSTM layer and the CRF layer are arranged in the text labeling model module
Presentation layer: each unit in the sentence represents a vector formed by embedding characters or words; the word embedding is initialized randomly, and the word embedding is obtained through data training; all embeddings are adjusted to be optimal in the training process;
Bi-LSTM layer: using the word embedding or word embedding vector obtained by the presentation layer as the input of each time step of the bidirectional LSTM; outputting respective scores of all labels of each word of the sentence through bidirectional LSTM layer training;
CRF layer: the layer uses the output of the Bi-LSTM layer, namely the emission probability matrix of all labels of each word, and the transition probability matrix calculated based on the text as the parameters of the original CRF model, finally obtains the probability of the label sequence, and selects the label class with the maximum probability value, namely the category of each word.
9. The sensitive data discovery system based on text recognition of claim 6, wherein: in the model module for constructing the classification judgment model, a Catboost algorithm is utilized to train a variable classification model, specifically
The gradient lifting tree model is represented as an additive model of the decision tree:
wherein T (x; theta)m) Represents a decision tree, θmM is the number of the decision tree as a parameter of the decision tree;
firstly, determining an initial lifting tree f by adopting a forward distribution algorithm0(x) The model at step m is 0:
fm(x)=fm-1(x)+T(x;θm)
parameters for the next tree were determined by empirical risk minimization:
Figure FDA0002294522700000042
GBDT uses greedy targeting statistical method to process the average value of the labels corresponding to the typing variables for replacement, and the expression is:
Figure FDA0002294522700000043
where n denotes the number of samples, xi,kValue, Y, representing the k characteristic of the ith recordiIndicating the target label value corresponding to the ith record.
10. A sensitive data discovery system based on text recognition according to claim 9, wherein: the method for improving Greedy TBS by Catboost is to add a prior distribution term, and the expression is as follows:
Figure FDA0002294522700000044
where P is the added prior term and a is a weighting factor greater than 0.
CN201911195301.3A 2019-11-28 2019-11-28 Sensitive data discovery method and system based on text recognition Active CN110826320B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911195301.3A CN110826320B (en) 2019-11-28 2019-11-28 Sensitive data discovery method and system based on text recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911195301.3A CN110826320B (en) 2019-11-28 2019-11-28 Sensitive data discovery method and system based on text recognition

Publications (2)

Publication Number Publication Date
CN110826320A true CN110826320A (en) 2020-02-21
CN110826320B CN110826320B (en) 2023-10-13

Family

ID=69543062

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911195301.3A Active CN110826320B (en) 2019-11-28 2019-11-28 Sensitive data discovery method and system based on text recognition

Country Status (1)

Country Link
CN (1) CN110826320B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368527A (en) * 2020-02-28 2020-07-03 上海汇航捷讯网络科技有限公司 Key value matching method
CN111522946A (en) * 2020-04-22 2020-08-11 成都中科云集信息技术有限公司 Paper quality evaluation method based on attention long-short term memory recurrent neural network
CN111582825A (en) * 2020-05-09 2020-08-25 焦点科技股份有限公司 Product information auditing method and system based on deep learning
CN111611312A (en) * 2020-05-19 2020-09-01 四川万网鑫成信息科技有限公司 Data desensitization method based on rule engine and block chain technology
CN111666414A (en) * 2020-06-12 2020-09-15 上海观安信息技术股份有限公司 Method for detecting cloud service by sensitive data and cloud service platform
CN111753547A (en) * 2020-06-30 2020-10-09 上海观安信息技术股份有限公司 Keyword extraction method and system for sensitive data leakage detection
CN111752729A (en) * 2020-06-30 2020-10-09 上海观安信息技术股份有限公司 Method for constructing three-layer association relation model and three-layer relation identification method
CN112115264A (en) * 2020-09-14 2020-12-22 中国科学院计算技术研究所苏州智能计算产业技术研究院 Text classification model adjusting method facing data distribution change
CN112232073A (en) * 2020-11-06 2021-01-15 山西三友和智慧信息技术股份有限公司 Bi-LSTM neural network-based text normative detection system and detection method
CN112507376A (en) * 2020-12-01 2021-03-16 浙商银行股份有限公司 Sensitive data detection method and device based on machine learning
CN113378156A (en) * 2021-07-01 2021-09-10 上海观安信息技术股份有限公司 Malicious file detection method and system based on API
CN115757823A (en) * 2022-11-10 2023-03-07 魔方医药科技(苏州)有限公司 Data processing method and device, electronic equipment and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1647970A1 (en) * 2004-10-15 2006-04-19 Microsoft Corporation Hidden conditional random field models for phonetic classification and speech recognition
CN104933443A (en) * 2015-06-26 2015-09-23 北京途美科技有限公司 Automatic identifying and classifying method for sensitive data
CN106886516A (en) * 2017-02-27 2017-06-23 竹间智能科技(上海)有限公司 The method and device of automatic identification statement relationship and entity
CN107818077A (en) * 2016-09-13 2018-03-20 北京金山云网络技术有限公司 A kind of sensitive content recognition methods and device
WO2018178162A1 (en) * 2017-03-28 2018-10-04 Koninklijke Philips N.V. Method and apparatus for intra- and inter-platform information transformation and reuse in predictive analytics and pattern recognition
CN108845988A (en) * 2018-06-07 2018-11-20 苏州大学 A kind of entity recognition method, device, equipment and computer readable storage medium
CN108920460A (en) * 2018-06-26 2018-11-30 武大吉奥信息技术有限公司 A kind of training method and device of the multitask deep learning model of polymorphic type Entity recognition
CN109447078A (en) * 2018-10-23 2019-03-08 四川大学 A kind of detection recognition method of natural scene image sensitivity text
US20190180195A1 (en) * 2015-01-23 2019-06-13 Conversica, Inc. Systems and methods for training machine learning models using active learning
WO2019129775A1 (en) * 2017-12-25 2019-07-04 Koninklijke Philips N.V. A hierarchical entity recognition and semantic modeling framework for information extraction
CN110263166A (en) * 2019-06-18 2019-09-20 北京海致星图科技有限公司 Public sentiment file classification method based on deep learning
WO2019184124A1 (en) * 2018-03-30 2019-10-03 平安科技(深圳)有限公司 Risk-control model training method, risk identification method and apparatus, and device and medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1647970A1 (en) * 2004-10-15 2006-04-19 Microsoft Corporation Hidden conditional random field models for phonetic classification and speech recognition
US20190180195A1 (en) * 2015-01-23 2019-06-13 Conversica, Inc. Systems and methods for training machine learning models using active learning
CN104933443A (en) * 2015-06-26 2015-09-23 北京途美科技有限公司 Automatic identifying and classifying method for sensitive data
CN107818077A (en) * 2016-09-13 2018-03-20 北京金山云网络技术有限公司 A kind of sensitive content recognition methods and device
CN106886516A (en) * 2017-02-27 2017-06-23 竹间智能科技(上海)有限公司 The method and device of automatic identification statement relationship and entity
WO2018178162A1 (en) * 2017-03-28 2018-10-04 Koninklijke Philips N.V. Method and apparatus for intra- and inter-platform information transformation and reuse in predictive analytics and pattern recognition
WO2019129775A1 (en) * 2017-12-25 2019-07-04 Koninklijke Philips N.V. A hierarchical entity recognition and semantic modeling framework for information extraction
WO2019184124A1 (en) * 2018-03-30 2019-10-03 平安科技(深圳)有限公司 Risk-control model training method, risk identification method and apparatus, and device and medium
CN108845988A (en) * 2018-06-07 2018-11-20 苏州大学 A kind of entity recognition method, device, equipment and computer readable storage medium
CN108920460A (en) * 2018-06-26 2018-11-30 武大吉奥信息技术有限公司 A kind of training method and device of the multitask deep learning model of polymorphic type Entity recognition
CN109447078A (en) * 2018-10-23 2019-03-08 四川大学 A kind of detection recognition method of natural scene image sensitivity text
CN110263166A (en) * 2019-06-18 2019-09-20 北京海致星图科技有限公司 Public sentiment file classification method based on deep learning

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
IEEE: "Topic modelling enriched LSTM models for the detection of novel and emerging named entities from social media", 《2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)》 *
IEEE: "Topic modelling enriched LSTM models for the detection of novel and emerging named entities from social media", 《2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)》, 15 January 2018 (2018-01-15) *
张晓海等: "基于BI-LSTM-CRF的作战文书命名实体识别", 《信息工程大学学报》 *
张晓海等: "基于BI-LSTM-CRF的作战文书命名实体识别", 《信息工程大学学报》, no. 04, 15 August 2019 (2019-08-15) *
张淑静等: "基于Bi-LSTM-CRF算法的气象预警信息质控系统的实现", 《计算机与现代化》 *
张淑静等: "基于Bi-LSTM-CRF算法的气象预警信息质控系统的实现", 《计算机与现代化》, no. 06, 14 June 2019 (2019-06-14) *
金宸等: "基于双向LSTM神经网络模型的中文分词", 《中文信息学报》 *
金宸等: "基于双向LSTM神经网络模型的中文分词", 《中文信息学报》, no. 02, 15 February 2018 (2018-02-15) *
陈世梅等: "基于BiLSTM-CRF模型的汉语否定信息识别", 《中文信息学报》 *
陈世梅等: "基于BiLSTM-CRF模型的汉语否定信息识别", 《中文信息学报》, no. 11, 15 November 2018 (2018-11-15) *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368527A (en) * 2020-02-28 2020-07-03 上海汇航捷讯网络科技有限公司 Key value matching method
CN111368527B (en) * 2020-02-28 2023-06-20 上海汇航捷讯网络科技有限公司 Key value matching method
CN111522946A (en) * 2020-04-22 2020-08-11 成都中科云集信息技术有限公司 Paper quality evaluation method based on attention long-short term memory recurrent neural network
CN111582825B (en) * 2020-05-09 2021-02-12 焦点科技股份有限公司 Product information auditing method and system based on deep learning
CN111582825A (en) * 2020-05-09 2020-08-25 焦点科技股份有限公司 Product information auditing method and system based on deep learning
CN111611312A (en) * 2020-05-19 2020-09-01 四川万网鑫成信息科技有限公司 Data desensitization method based on rule engine and block chain technology
CN111666414A (en) * 2020-06-12 2020-09-15 上海观安信息技术股份有限公司 Method for detecting cloud service by sensitive data and cloud service platform
CN111666414B (en) * 2020-06-12 2023-10-17 上海观安信息技术股份有限公司 Method for detecting cloud service by sensitive data and cloud service platform
CN111753547A (en) * 2020-06-30 2020-10-09 上海观安信息技术股份有限公司 Keyword extraction method and system for sensitive data leakage detection
CN111752729B (en) * 2020-06-30 2023-06-27 上海观安信息技术股份有限公司 Method for constructing three-layer association relation model and three-layer relation identification method
CN111752729A (en) * 2020-06-30 2020-10-09 上海观安信息技术股份有限公司 Method for constructing three-layer association relation model and three-layer relation identification method
CN111753547B (en) * 2020-06-30 2024-02-27 上海观安信息技术股份有限公司 Keyword extraction method and system for sensitive data leakage detection
CN112115264A (en) * 2020-09-14 2020-12-22 中国科学院计算技术研究所苏州智能计算产业技术研究院 Text classification model adjusting method facing data distribution change
CN112115264B (en) * 2020-09-14 2024-03-22 中科苏州智能计算技术研究院 Text classification model adjustment method for data distribution change
CN112232073A (en) * 2020-11-06 2021-01-15 山西三友和智慧信息技术股份有限公司 Bi-LSTM neural network-based text normative detection system and detection method
CN112507376A (en) * 2020-12-01 2021-03-16 浙商银行股份有限公司 Sensitive data detection method and device based on machine learning
CN112507376B (en) * 2020-12-01 2024-01-05 浙商银行股份有限公司 Sensitive data detection method and device based on machine learning
CN113378156A (en) * 2021-07-01 2021-09-10 上海观安信息技术股份有限公司 Malicious file detection method and system based on API
CN115757823A (en) * 2022-11-10 2023-03-07 魔方医药科技(苏州)有限公司 Data processing method and device, electronic equipment and storage medium
CN115757823B (en) * 2022-11-10 2024-03-05 魔方医药科技(苏州)有限公司 Data processing method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110826320B (en) 2023-10-13

Similar Documents

Publication Publication Date Title
CN110826320A (en) Sensitive data discovery method and system based on text recognition
CN112613501A (en) Information auditing classification model construction method and information auditing method
CN109558541B (en) Information processing method and device and computer storage medium
CN109800354B (en) Resume modification intention identification method and system based on block chain storage
CN113535963B (en) Long text event extraction method and device, computer equipment and storage medium
CN107545038B (en) Text classification method and equipment
CN112016313B (en) Spoken language element recognition method and device and warning analysis system
CN110750978A (en) Emotional tendency analysis method and device, electronic equipment and storage medium
CN108363691A (en) A kind of field term identifying system and method for 95598 work order of electric power
CN111462752A (en) Client intention identification method based on attention mechanism, feature embedding and BI-L STM
CN110910175A (en) Tourist ticket product portrait generation method
CN116645129A (en) Manufacturing resource recommendation method based on knowledge graph
CN109101487A (en) Conversational character differentiating method, device, terminal device and storage medium
CN110532374B (en) Insurance information processing method and device
CN111966640A (en) Document file identification method and system
CN115482075A (en) Financial data anomaly analysis method and device, electronic equipment and storage medium
CN109993381B (en) Demand management application method, device, equipment and medium based on knowledge graph
CN115952770A (en) Data standardization processing method and device, electronic equipment and storage medium
CN113657437B (en) Power grid overhaul alarm confirmation method and system
CN114037154A (en) Method and system for predicting scientific and technological achievement number and theme based on attention characteristics
CN111798217A (en) Data analysis system and method
CN114338058A (en) Information processing method, device and storage medium
CN115080732A (en) Complaint work order processing method and device, electronic equipment and storage medium
CN114065934A (en) Method and system for constructing semantic knowledge base in environmental impact evaluation field
CN110569435A (en) Intelligent dual-ended recommendation engine system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant