CN113642030B - Sensitive data multi-layer identification method - Google Patents

Sensitive data multi-layer identification method Download PDF

Info

Publication number
CN113642030B
CN113642030B CN202111194834.7A CN202111194834A CN113642030B CN 113642030 B CN113642030 B CN 113642030B CN 202111194834 A CN202111194834 A CN 202111194834A CN 113642030 B CN113642030 B CN 113642030B
Authority
CN
China
Prior art keywords
sensitive
data
rule
identification
degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111194834.7A
Other languages
Chinese (zh)
Other versions
CN113642030A (en
Inventor
吕丹
洪俊鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Hongshu Technology Co ltd
Original Assignee
Guangdong Hongshu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Hongshu Technology Co ltd filed Critical Guangdong Hongshu Technology Co ltd
Priority to CN202111194834.7A priority Critical patent/CN113642030B/en
Publication of CN113642030A publication Critical patent/CN113642030A/en
Application granted granted Critical
Publication of CN113642030B publication Critical patent/CN113642030B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data

Abstract

The invention relates to a sensitive data multilayer identification method, which comprises the steps of obtaining a data unit sample of a field to be detected, determining a sensitive classification rule matched with each data unit, then carrying out matching calculation on the data units according to the sensitive classification rule, determining the ratio of the number of the matched data units to the total number of the data unit samples according to the matching calculation, and obtaining the identification degree of a first sensitive rule. And further, loading the field to be detected into the sensitive data identification model, obtaining a second sensitive rule identification degree output by the sensitive data identification model, and finally judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree. Based on the method, the identification accuracy of the irregular field to be detected is improved through the sensitive data identification model.

Description

Sensitive data multi-layer identification method
Technical Field
The invention relates to the technical field of data security, in particular to a sensitive data multilayer identification method.
Background
With the rapid development of the internet, the data security problem causes extensive attention of the whole people, and the security incident of leakage of personal information and sensitive information may cause serious cyber crime. The traditional sensitive data discovery technology is particularly low in recognition rate of non-standard sensitive data, and sensitive data leakage risks caused by omission easily occur.
In the traditional sensitive data discovery technology, sensitive data is identified and positioned based on technical means such as regular expression matching, keyword code table mapping, data type definition discrimination, data characteristic calculation and the like, for the traditional technical means, the premise that the sensitive data can be accurately discovered is that the data quality is high, the data quality is poor due to the fact that the data acquisition process of certain enterprises is not standard, for example, some special characters exist in a client address field, key identification information such as province and city areas is lost, non-address data and the like, the identification accuracy rate by the traditional technical means is low, the requirement of the enterprises on the accuracy of sensitive data discovery cannot be met, the production cost of the enterprises can be improved due to too much manual intervention, and the private data of users can be indirectly leaked due to the omission of visual inspection.
It can be seen that the above drawbacks exist with conventional sensitive data discovery techniques.
Disclosure of Invention
Based on this, it is necessary to provide a multi-layer identification method for sensitive data, aiming at the defects existing in the conventional sensitive data discovery technology.
A sensitive data multi-layer identification method, comprising the steps of:
acquiring a data unit sample of a field to be detected, and determining a sensitive classification rule matched with each data unit; wherein the data unit samples comprise a plurality of data units;
performing matching calculation on the data units according to the sensitive classification rule;
determining the ratio of the number of matched data units to the total number of data unit samples according to matching calculation to obtain the identification degree of the first sensitive rule;
and when the identification degree of the first sensitive rule is greater than a preset sensitive threshold value, judging that the field to be detected belongs to the corresponding type of the sensitive classification rule.
According to the sensitive data multi-layer identification method, after the data unit samples of the fields to be detected are obtained and the sensitive classification rules matched with the data units are determined, the data units are subjected to matching calculation according to the sensitive classification rules, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to the matching calculation, and the first sensitive rule identification degree is obtained. And when the identification degree of the first sensitive rule is greater than a preset sensitive threshold value, judging that the field to be detected belongs to the corresponding type of the sensitive classification rule. Based on the method, the corresponding type of the sensitive classification rule of the sensitive data type of the field to be detected is automatically identified through rule matching of all data units in the field to be detected. Meanwhile, the accuracy of the corresponding type of the sensitive classification rule as the identification result is improved by limiting the preset sensitive threshold.
In one embodiment, the process of determining the sensitive classification rule matching each data unit includes the steps of:
and calculating the regular expression of the data unit, and calculating the sensitive rule matched with the data unit.
In one embodiment, the process of performing matching computation on data units according to sensitive classification rules includes the steps of:
and when the sensitive classification rule has a corresponding sensitive data characteristic code table, performing matching calculation on the character string characteristic identification of the data unit according to the sensitive data characteristic code table.
In one embodiment, the process of performing matching computation on data units according to sensitive classification rules includes the steps of:
and when the sensitive classification rule needs to be strongly checked, performing matching calculation on the data unit according to the data rule corresponding to the strong check.
A sensitive data multi-layer identification method, comprising the steps of:
acquiring a data unit sample of a field to be detected, and determining a sensitive classification rule matched with each data unit; wherein the data unit samples comprise a plurality of data units;
performing matching calculation on the data units according to the sensitive classification rule;
determining the ratio of the number of matched data units to the total number of data unit samples according to matching calculation to obtain the identification degree of the first sensitive rule;
loading the field to be detected into the sensitive data identification model to obtain a second sensitive rule identification degree output by the sensitive data identification model;
and judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree.
According to the sensitive data multi-layer identification method, after the data unit samples of the fields to be detected are obtained and the sensitive classification rules matched with the data units are determined, the data units are subjected to matching calculation according to the sensitive classification rules, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to the matching calculation, and the first sensitive rule identification degree is obtained. And further, loading the field to be detected into the sensitive data identification model, obtaining a second sensitive rule identification degree output by the sensitive data identification model, and finally judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree. Based on the method, the identification accuracy of the irregular field to be detected is improved through the sensitive data identification model.
In one of the embodiments, the first and second electrodes are,
in one embodiment, the training process of the sensitive data recognition model comprises the following steps:
obtaining corpus data and stop words;
preprocessing the material data and stop words to obtain a preprocessing result;
and performing multiple times of model training according to the preprocessing result to obtain a sensitive data recognition model.
In one embodiment, the process of determining that the field to be tested belongs to the corresponding type of the greater of the first sensitivity rule identification degree and the second sensitivity rule identification degree includes the steps of:
and when the first sensitivity rule identification degree is greater than a preset sensitivity threshold value and the second sensitivity rule identification degree is greater than the preset sensitivity threshold value, judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitivity rule identification degree and the second sensitivity rule identification degree.
A sensitive data multi-layer identification method, comprising the steps of:
acquiring a data unit sample of a field to be detected, and determining a sensitive classification rule matched with each data unit; wherein the data unit samples comprise a plurality of data units;
when the sensitive classification rule is matched with the data dictionary, counting the recognition degree of the third sensitive rule through the data dictionary; the data dictionary is bound with each sensitive rule in advance;
when the sensitive classification rule is not matched with the data dictionary, performing matching calculation on the data unit according to the sensitive classification rule;
determining the ratio of the number of matched data units to the total number of data unit samples according to matching calculation to obtain the identification degree of the first sensitive rule;
loading the field to be detected into the sensitive data identification model to obtain a second sensitive rule identification degree output by the sensitive data identification model;
and judging the corresponding type of the field to be detected according to the greater one of the first sensitive rule identification degree and the second sensitive rule identification degree and the third sensitive rule identification degree.
According to the sensitive data multi-layer identification method, after the data unit samples of the fields to be detected are obtained, the sensitive classification rules matched with the data units are determined, when the sensitive classification rules are matched with the data dictionary, the third sensitive rule identification degree is counted through the data dictionary, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to matching calculation, the first sensitive rule identification degree is obtained, the fields to be detected are loaded into the sensitive data identification model, and the second sensitive rule identification degree output by the sensitive data identification model is obtained. And further, according to the larger one of the first sensitivity rule identification degree and the second sensitivity rule identification degree and the third sensitivity rule identification degree, judging the corresponding type of the field to be detected. Based on the method, through the preset attributes of the data dictionary, the fields to be detected of various users can be conveniently identified, the AI identification limitation of the sensitive data identification model is eliminated, and the identification range and accuracy are improved.
Drawings
FIG. 1 is a flow diagram of one embodiment of a method for multi-level identification of sensitive data;
FIG. 2 is a flow chart of another embodiment of a method for multi-level identification of sensitive data;
FIG. 3 is a flow diagram of one embodiment of another sensitive data multi-tier identification method;
FIG. 4 is a flow diagram of a sensitive data recognition model training method according to an embodiment;
FIG. 5 is a flow chart of an embodiment of a multi-tier sensitive data identification method;
FIG. 6 is a block diagram of an embodiment of a sensitive data multiple identification device;
fig. 7 is a schematic diagram of the internal structure of a computer according to another embodiment.
Detailed Description
For better understanding of the objects, technical solutions and effects of the present invention, the present invention will be further explained with reference to the accompanying drawings and examples. Meanwhile, the following described examples are only for explaining the present invention, and are not intended to limit the present invention.
The embodiment of the invention provides a sensitive data multilayer identification method.
Fig. 1 is a flowchart of an embodiment of a sensitive data multi-layer identification method, and as shown in fig. 1, an embodiment of a sensitive data multi-layer identification method includes steps S100 to S103:
s100, acquiring a data unit sample of a field to be detected, and determining a sensitive classification rule matched with each data unit; wherein the data unit samples comprise a plurality of data units;
s101, performing matching calculation on the data units according to the sensitive classification rules;
s102, determining the ratio of the number of matched data units to the total number of data unit samples according to matching calculation, and obtaining the identification degree of a first sensitive rule;
s103, when the identification degree of the first sensitive rule is larger than a preset sensitive threshold value, the field to be detected is judged to belong to the corresponding type of the sensitive classification rule.
The field to be tested comprises a plurality of data units, and each data unit forms a data unit sample. Wherein the data unit represents a value of a table field, and a plurality of values form the table field to further form the field to be tested.
The sensitive classification rule is a classification rule that a Data Security Center (DSC Data Security Center) supports customized sensitive information, namely one of sensitive Data identification rules. The data security center identifies and alarms the sensitive data in the file or the table through the sensitive data identification rule, and a user can define and manage the sensitive identification rule according to business needs to determine various sensitive rules. For example, sensitive rules include name sensitive rules and phone number sensitive rules — name sensitive rules contain regular expressions, common name identification; the telephone number sensitivity rule comprises a regular expression, identification of the first three special service numbers and the like.
Therefore, the data unit can determine the matched sensitive data identification rule, namely the sensitive classification rule according to the DSC customization in advance. The definition and matching identification of the sensitive data identification rule on the data unit can be realized by using keywords, regular expressions or built-in algorithms. For example, when the DSC has been configured with sensitive data identification rules of the types such as the mobile phone number and the identification number in advance, the sensitive classification rule corresponding to the data unit may be determined to include the sensitive data identification rule corresponding to the types such as the mobile phone number and the identification number according to the data unit matching.
Based on the sensitive data identification rule, classifying the data of the field to be detected comprises performing sensitive grade classification and rule classification on the field to be detected. In the DSC system, the sensitivity registration comprises an unknown sensitivity level, a low sensitivity level, a medium sensitivity level, a high sensitivity level and the like; the rule classification comprises personal sensitive information, equipment sensitive information, key sensitive information, sensitive picture information, enterprise sensitive information, position sensitive information, general sensitive information and the like.
For example, when defining the sensitive data identification rule, the content of the sensitive data identification rule is defined according to rule classification to match the data unit of the field to be measured.
And when the sensitive classification rule is identified by keyword matching, matching the sensitive classification rule matched with the data unit according to the sensitive data keyword field input in advance in the keyword text box and the data unit.
In one embodiment, fig. 2 is a flowchart of another implementation of a sensitive data multi-layer identification method, as shown in fig. 2, a process of determining a sensitive classification rule matching each data unit in step S100 includes step S200:
s200, performing regular expression calculation on the data units, and calculating the sensitive rules matched with the data units.
And calculating the sensitive rule matched with the data unit through the regular expression, namely a sensitive classification rule. For example, in the pre-resume of the sensitive data recognition rule, regular expressions of corresponding types are pre-entered in the rule definition text box. In the matching calculation, the data unit is calculated according to the regular expression, and the sensitive classification rule is determined.
In one embodiment, regular expression calculation is performed through regular expression matching, the system performs regular matching on the data unit according to the regular expression configured by the sensitive rule, and if the matching is successful, the data unit can be preliminarily judged to meet the matched sensitive rule.
Based on the method, the data format of the data unit in the field to be detected is determined through the matching of the sensitive classification rules, and the first-layer identification of the sensitive data is completed.
On the basis that the S100 completes the matching of the sensitive classification rule, further, the data unit is subjected to matching calculation through the sensitive classification rule, and the data characteristics of the data unit are determined, so that the second-layer identification of the sensitive data after the data format is determined is completed.
The data characteristics of the data unit comprise character strings, data conversion algorithms or sensitive word fusion and the like. And S101, performing matching calculation on the data units according to the sensitive classification rules, including performing feature matching calculation according to the sensitive classification rules, and determining data features corresponding to the data units.
In one embodiment, as shown in fig. 2, the process of performing matching calculation on the data unit according to the sensitive classification rule in step S101 includes step S201:
s201, when the sensitive classification rule has a corresponding sensitive data feature code table, matching calculation is carried out on the character string feature identification of the data unit according to the sensitive data feature code table.
In the presetting of the sensitive data identification rule, the data characteristics are pre-configured, and a sensitive data characteristic code table is determined. And the sensitive data feature code table corresponds to the sensitive classification rule. And performing matching calculation on the character string feature identification of the data unit according to the character string pre-stored in the sensitive data feature code table, and determining the data feature of the data unit according to character matching.
For example, the sensitive data feature code table includes a common name code table, a special service number table of the first three digits of the telephone number, a regional code table of the first six digits of the identity card, an address area code table, and the like, and the system performs matching verification on the data through the feature code table.
And according to the determined data feature matching condition, calculating the proportion of the data features in the field to be detected which accord with the matching calculation, determining the ratio of the number of the matched data units to the total number of the data unit samples, and determining the first sensitive rule identification degree according to the ratio. In one embodiment, the first sensitivity rule identification is determined according to the product of the ratio and the correction coefficient.
In one embodiment, as shown in fig. 2, the process of performing matching calculation on the data unit according to the sensitive classification rule in step S101 further includes step S202:
s202, when the sensitive classification rule needs to be strongly verified, matching calculation is carried out on the data unit according to the data rule corresponding to the strong verification.
When the sensitive data identification rules are preset, the data rules of all sensitive data identification rule team members can be determined, and when the sensitive classification rules need to be strongly verified, the data units are subjected to matching calculation according to the data rules corresponding to the strong verification.
The following explains the data rule corresponding to the strong check with a specific example:
1. the last digit of the identification number is a check code, and calculation is required according to the national standard check rule.
2. The organization code, the national organization code is composed of eight data body codes and one digit (or capital Latin letter) check code, the check formula is as follows:
Figure 254554DEST_PATH_IMAGE001
3. and the bank card number is verified by adopting a Luhn algorithm.
In one embodiment, the check code of the data unit is subjected to matching calculation through the data rule to complete strong check.
And after judging that the field to be detected belongs to the corresponding type of the sensitive classification rule, carrying out data classification and classification on the field to be detected according to the sensitive data identification rule to which the sensitive classification rule belongs.
In one embodiment, the sensitive rule set threshold comprises 0-100%. The sensitive rule setting threshold can be set according to data quality, and can be set to be more than 50% when the data quality is high, and can be set to be less than 50% when the data quality is low. As a preferred embodiment, the sensitive rule sets the threshold to 50%.
According to the sensitive data multilayer identification method, after the data unit samples of the fields to be detected are obtained and the sensitive classification rules matched with the data units are determined, the data units are subjected to matching calculation according to the sensitive classification rules, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to the matching calculation, and the first sensitive rule identification degree is obtained. And when the identification degree of the first sensitive rule is greater than a preset sensitive threshold value, judging that the field to be detected belongs to the corresponding type of the sensitive classification rule. Based on the method, the corresponding type of the sensitive classification rule of the sensitive data type of the field to be detected is automatically identified through rule matching of all data units in the field to be detected. Meanwhile, the accuracy of the corresponding type of the sensitive classification rule as the identification result is improved by limiting the preset sensitive threshold.
Further, the sensitive data multilayer identification method identifies the sensitive data based on data format, data characteristics and data rule strong check. However, when the data format of the field to be detected is relatively disordered and has no obvious format and characteristics, the identification precision of the sensitive data multilayer identification method is low. Based on the method, the embodiment of the invention also provides another sensitive data multi-layer identification method.
Fig. 3 is a flowchart of an embodiment of another sensitive data multi-layer identification method, and as shown in fig. 3, an embodiment of the other sensitive data multi-layer identification method includes steps S300 to S304:
s300, acquiring a data unit sample of a field to be detected, and determining a sensitive classification rule matched with each data unit; wherein the data unit samples comprise a plurality of data units;
s301, performing matching calculation on the data units according to the sensitive classification rules;
s302, determining the ratio of the number of the matched data units to the total number of the data unit samples according to matching calculation, and obtaining the identification degree of the first sensitive rule;
s303, loading the field to be detected into the sensitive data identification model to obtain a second sensitive rule identification degree output by the sensitive data identification model;
s304, judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree.
Wherein the sensitive data recognition model is implemented in pre-training. Fig. 4 is a flowchart of a sensitive data recognition model training method according to an embodiment, and as shown in fig. 4, a training process of a sensitive data recognition model includes steps S400 to S402:
s400, obtaining corpus data and stop words;
s401, preprocessing the material data and stop words to obtain a preprocessing result;
and S402, performing model training for multiple times according to the preprocessing result to obtain a sensitive data recognition model.
The corpus data and the stop words are obtained from the database, and basic data are provided for model training. In one embodiment, the corpus data includes extracted table field data.
In one embodiment, stop words include special characters, other character strings, etc. that are not relevant to the corpus data.
The corpus data and stop words are converted into parametric data, e.g., vectorized data, that is adapted to the sensitive data recognition model by preprocessing the corpus data and stop words.
In one embodiment, the process of preprocessing the text data and stop words in step S401 to obtain a preprocessing result includes the steps of: and encapsulating the corpus data and the word stopping parameters to obtain parameters serving as preprocessing results.
And encapsulating the corpus data and stop words into parameters acceptable by a sensitive data recognition model as a preprocessing result.
In one embodiment, the process of encapsulating the corpus data and the stop word parameter to obtain the parameter as the preprocessing result includes the steps of:
performing word segmentation processing on the corpus data to obtain a word segmentation list;
removing stop words of the word segmentation list to obtain a targeted word segmentation list;
and packaging the targeted word segmentation list into vectorized parameters as a preprocessing result.
And the stop words of the neglected word segmentation list are removed, so that the subsequent model training is more targeted.
Wherein the model training comprises algebraic training. In one embodiment, the sensitive data recognition model comprises a text classification model, including a Doc2vec model or a word2vec model, and the like. As a preferred embodiment, the sensitive data identification model is a Doc2vec model.
In one embodiment, more than 10-20 times of model training are carried out, and 10-20 generations of sensitive data recognition models are generated.
In one embodiment, the process of performing multiple model training according to the preprocessing result in step S402 includes the steps of:
and performing more than 10 times of model training according to the preprocessing result.
And performing more than 10 times of model training on the preprocessing result to obtain a 10-generation sensitive data recognition model.
And after the sensitive data identification model is determined, taking the field to be detected as sample data to be identified of the sensitive data identification model.
Wherein, the sample data comprises quantitative data for extracting a certain table field.
Preprocessing sample data to obtain a sample processing result;
and packaging the sample data into parameters acceptable by the sensitive data identification model as a sample processing result.
In one embodiment, the process of preprocessing sample data to obtain a sample processing result includes the steps of:
and packaging the sample data parameters to obtain parameters serving as sample processing results.
Specifically, word segmentation processing is carried out on the sample data to generate a word segmentation list, and the word segmentation list is packaged into vector parameters acceptable by the sensitive data identification model.
And loading the sample processing result into the sensitive data identification model to obtain the identification rate which is output by the sensitive data identification model and is used as the identification result.
In one embodiment, one sensitive rule corresponds to one sensitive data identification model, the system loads all models, identifies data unit samples and outputs matched models, and then the sensitive rule corresponding to the model can be obtained.
In one embodiment, the process of loading the sample processing result into the sensitive data recognition model and obtaining the recognition rate output by the sensitive data recognition model as the recognition result includes the following steps:
and loading the sample processing result into the sensitive data identification model, outputting the identification rate as an identification result when the identification rate output by the sensitive data identification model is greater than a preset sensitive threshold, and otherwise, controlling the sensitive data identification model to repeat model operation.
And the identification result is the identification degree of the second sensitive rule.
Wherein the preset sensitivity threshold = number of identified samples/total number of samples, including 70% to 90%. As a preferred embodiment, the preset sensitivity threshold is 80%. And when the recognition rate output by the sensitive data recognition model is greater than 80%, outputting the recognition rate as a recognition result, otherwise, controlling the sensitive data recognition model to repeat model operation.
In one embodiment, the step S304 of determining that the field to be tested belongs to the corresponding type of the greater one of the first sensitivity rule identification degree and the second sensitivity rule identification degree includes the steps of:
and when the first sensitivity rule identification degree is greater than a preset sensitivity threshold value and the second sensitivity rule identification degree is greater than the preset sensitivity threshold value, judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitivity rule identification degree and the second sensitivity rule identification degree.
And when the first sensitivity rule identification degree and the second sensitivity rule identification degree are both greater than a preset sensitivity threshold value, performing comparison judgment.
And judging the corresponding type of the field to be detected, which belongs to the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree, namely the corresponding sensitive data identification rule, so as to classify and grade the data.
According to the other sensitive data multi-layer identification method, after the data unit samples of the fields to be detected are obtained and the sensitive classification rules matched with the data units are determined, the data units are subjected to matching calculation according to the sensitive classification rules, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to the matching calculation, and the identification degree of the first sensitive rule is obtained. And further, loading the field to be detected into the sensitive data identification model, obtaining a second sensitive rule identification degree output by the sensitive data identification model, and finally judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree. Based on the method, the identification accuracy of the irregular field to be detected is improved through the sensitive data identification model.
Further, the embodiment of the invention also provides another sensitive data multilayer identification method.
Fig. 5 is a flowchart of an embodiment of a multi-layer sensitive data recognition method, and as shown in fig. 5, an embodiment of the multi-layer sensitive data recognition method includes steps S500 to S505:
s500, acquiring a data unit sample of a field to be detected, and determining a sensitive classification rule matched with each data unit; wherein the data unit samples comprise a plurality of data units;
s501, when the sensitive classification rule is matched with the data dictionary, counting the recognition degree of a third sensitive rule through the data dictionary; the data dictionary is bound with each sensitive rule in advance;
s502, when the sensitive classification rule is not matched with the data dictionary, performing matching calculation on the data unit according to the sensitive classification rule;
s503, determining the ratio of the number of the matched data units to the total number of the data unit samples according to matching calculation, and obtaining the identification degree of the first sensitive rule;
s504, loading the field to be detected into the sensitive data identification model, and obtaining a second sensitive rule identification degree output by the sensitive data identification model;
and S505, judging the corresponding type of the field to be detected according to the greater one of the first sensitive rule identification degree and the second sensitive rule identification degree and the third sensitive rule identification degree.
And performing recognition degree matching calculation of the third sensitive rule through a pre-established data dictionary. The data dictionary stores user-defined codes; and the user-defined code is used for calculating the identification degree of the third sensitive rule. Based on the method, the data dictionary is bound with each sensitive rule in advance, and various sensitive rules including sensitive data identification rules are determined through the data dictionary.
And when the sensitive classification rule can be matched with the data dictionary, counting the recognition degree of the third sensitive rule through the data dictionary. In one embodiment, the sensitive classification rule is determined by data format matching, i.e., by data format records of the data dictionary, and whether the sensitive classification rule matches the data dictionary is determined.
In one embodiment, the step S505 of determining the corresponding type of the field to be tested according to the greater of the first sensitive rule identification degree and the second sensitive rule identification degree and the third sensitive rule identification degree includes the steps of:
and judging that the field to be detected belongs to the corresponding type of the larger one and the larger one in the third sensitivity rule identification degree.
The method comprises the steps of comparing a first sensitive rule identification degree with a second sensitive rule identification degree, comparing the first sensitive rule identification degree with a third sensitive rule identification degree to determine the maximum sensitive rule identification degree, carrying out type correspondence of sensitive rules, determining sensitive data identification rules, and carrying out data classification and classification.
In the above another sensitive data multi-layer identification method, after the data unit samples of the fields to be detected are obtained and the sensitive classification rules matched with the data units are determined, when the sensitive classification rules are matched with the data dictionary, the third sensitive rule identification degree is counted by the data dictionary, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to matching calculation, the first sensitive rule identification degree is obtained, and the fields to be detected are loaded into the sensitive data identification model, so that the second sensitive rule identification degree output by the sensitive data identification model is obtained. And further, according to the larger one of the first sensitivity rule identification degree and the second sensitivity rule identification degree and the third sensitivity rule identification degree, judging the corresponding type of the field to be detected. Based on the method, through the preset attributes of the data dictionary, the fields to be detected of various users can be conveniently identified, the AI identification limitation of the sensitive data identification model is eliminated, and the identification range and accuracy are improved.
The embodiment of the invention also provides a sensitive data multilayer identification device.
Fig. 6 is a block diagram of a sensitive data multiple-recognition apparatus according to an embodiment, as shown in fig. 6, including:
the sensitive classification rule matching module 100 is configured to obtain a data unit sample of a field to be detected, and determine a sensitive classification rule matched with each data unit;
the first matching calculation module 101 is used for performing matching calculation on the data units according to the sensitive classification rule; or, when the sensitive classification rule is not matched with the data dictionary, performing matching calculation on the data unit according to the sensitive classification rule;
the first identification degree calculation module 102 is configured to determine a ratio of the number of matched data units to the total number of data unit samples according to matching calculation, and obtain a first sensitive rule identification degree;
the first judging module 103 is used for judging that the field to be detected belongs to the corresponding type of the sensitive classification rule when the identification degree of the first sensitive rule is greater than a preset sensitive threshold value;
the second recognition degree calculation module 104 is configured to load the field to be detected into the sensitive data recognition model, and obtain a second sensitive rule recognition degree output by the sensitive data recognition model;
a second determination module 105, configured to determine that the field to be detected belongs to the corresponding type of the greater one of the first sensitivity rule identification degree and the second sensitivity rule identification degree;
the third recognition degree calculating module 106 is configured to count recognition degrees of the third sensitive rule through the data dictionary when the sensitive classification rule is matched with the data dictionary; the data dictionary is bound with each sensitive rule in advance;
and a third determining module 107, configured to determine, according to the greater of the first sensitivity rule identification degree and the second sensitivity rule identification degree, the corresponding type to which the field to be detected belongs, with the third sensitivity rule identification degree.
According to the sensitive data multilayer recognition device, after the data unit samples of the fields to be detected are obtained, the sensitive classification rules matched with the data units are determined, when the sensitive classification rules are matched with the data dictionary, the third sensitive rule recognition degree is counted through the data dictionary, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to matching calculation, the first sensitive rule recognition degree is obtained, the fields to be detected are loaded into the sensitive data recognition model, and the second sensitive rule recognition degree output by the sensitive data recognition model is obtained. And further, according to the larger one of the first sensitivity rule identification degree and the second sensitivity rule identification degree and the third sensitivity rule identification degree, judging the corresponding type of the field to be detected. Based on the method, through the preset attributes of the data dictionary, the fields to be detected of various users can be conveniently identified, the AI identification limitation of the sensitive data identification model is eliminated, and the identification range and accuracy are improved.
The embodiment of the invention also provides a computer storage medium, on which computer instructions are stored, and when the instructions are executed by a processor, the sensitive data multi-layer identification method of any one of the above embodiments is realized.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a RAM, a ROM, a magnetic or optical disk, or various other media that can store program code.
Corresponding to the computer storage medium, in one embodiment, a computer device is further provided, where the computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the multi-layer identification method for sensitive data in any of the above embodiments is implemented.
The computer device may be a terminal, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a sensitive data multi-tier identification method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
After the data unit samples of the fields to be detected are obtained and the sensitive classification rules matched with the data units are determined, when the sensitive classification rules are matched with the data dictionary, the computer device counts the recognition degree of the third sensitive rule through the data dictionary, determines the ratio of the number of the matched data units to the total number of the data unit samples according to matching calculation, obtains the recognition degree of the first sensitive rule, loads the fields to be detected into the sensitive data recognition model, and obtains the recognition degree of the second sensitive rule output by the sensitive data recognition model. And further, according to the larger one of the first sensitivity rule identification degree and the second sensitivity rule identification degree and the third sensitivity rule identification degree, judging the corresponding type of the field to be detected. Based on the method, through the preset attributes of the data dictionary, the fields to be detected of various users can be conveniently identified, the AI identification limitation of the sensitive data identification model is eliminated, and the identification range and accuracy are improved.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (6)

1. A multi-layer identification method for sensitive data, comprising the steps of:
acquiring a data unit sample of a field to be detected, and determining a sensitive classification rule matched with each data unit; wherein the data unit samples comprise a plurality of data units;
when the sensitive classification rule is matched with a data dictionary, counting the recognition degree of a third sensitive rule through the data dictionary; the data dictionary is bound with each sensitive rule in advance; the data dictionary stores user-defined codes; the user-defined code is used for calculating the recognition degree of the third sensitive rule;
when the sensitive classification rule is not matched with the data dictionary, performing matching calculation on the data unit according to the sensitive classification rule;
determining the ratio of the number of the matched data units to the total number of the data unit samples according to the matching calculation to obtain a first sensitive rule identification degree;
loading the field to be detected into a sensitive data identification model to obtain a second sensitive rule identification degree output by the sensitive data identification model;
and judging the corresponding type of the field to be detected according to the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree and the third sensitive rule identification degree.
2. The sensitive data multi-layer identification method according to claim 1, wherein the process of determining the corresponding type of the field to be detected according to the greater of the first sensitive rule identification degree and the second sensitive rule identification degree and the third sensitive rule identification degree comprises the steps of:
and judging that the field to be tested belongs to the corresponding type of the larger one and the larger one in the third sensitivity rule identification degree.
3. The sensitive data multi-layer identification method according to claim 1, wherein the process of determining the sensitive classification rule matched with each data unit comprises the steps of:
and calculating the regular expression of the data unit, and calculating the sensitive rule matched with the data unit.
4. The multi-layer identification method for sensitive data according to claim 1, wherein the process of performing matching calculation on the data unit according to the sensitive classification rule comprises the following steps:
and when the sensitive classification rule has a corresponding sensitive data feature code table, performing matching calculation on the character string feature identification of the data unit according to the sensitive data feature code table.
5. The multi-layer identification method for sensitive data according to claim 1, wherein the process of performing matching calculation on the data unit according to the sensitive classification rule comprises the following steps:
and when the sensitive classification rule needs to be strongly checked, performing matching calculation on the data unit according to the data rule corresponding to the strong check.
6. The sensitive data multi-layer identification method according to claim 1, wherein the training process of the sensitive data identification model comprises the steps of:
obtaining corpus data and stop words;
preprocessing the corpus data and the stop words to obtain a preprocessing result;
and performing multiple times of model training according to the preprocessing result to obtain a sensitive data recognition model.
CN202111194834.7A 2021-10-14 2021-10-14 Sensitive data multi-layer identification method Active CN113642030B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111194834.7A CN113642030B (en) 2021-10-14 2021-10-14 Sensitive data multi-layer identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111194834.7A CN113642030B (en) 2021-10-14 2021-10-14 Sensitive data multi-layer identification method

Publications (2)

Publication Number Publication Date
CN113642030A CN113642030A (en) 2021-11-12
CN113642030B true CN113642030B (en) 2022-02-15

Family

ID=78426786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111194834.7A Active CN113642030B (en) 2021-10-14 2021-10-14 Sensitive data multi-layer identification method

Country Status (1)

Country Link
CN (1) CN113642030B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115174140A (en) * 2022-05-26 2022-10-11 中国电信股份有限公司 Data identification method and device, electronic equipment and nonvolatile storage medium
CN115310514A (en) * 2022-07-05 2022-11-08 上海淇毓信息科技有限公司 Method and device for identifying target type data in mass data
CN116090006B (en) * 2023-02-01 2023-09-08 北京三维天地科技股份有限公司 Sensitive identification method and system based on deep learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487447A (en) * 2020-11-25 2021-03-12 平安信托有限责任公司 Data security processing method, device, equipment and storage medium

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9785795B2 (en) * 2014-05-10 2017-10-10 Informatica, LLC Identifying and securing sensitive data at its source
CN105825138B (en) * 2015-01-04 2019-02-15 北京神州泰岳软件股份有限公司 A kind of method and apparatus of sensitive data identification
CN104933443A (en) * 2015-06-26 2015-09-23 北京途美科技有限公司 Automatic identifying and classifying method for sensitive data
US11216491B2 (en) * 2016-03-31 2022-01-04 Splunk Inc. Field extraction rules from clustered data samples
US10810317B2 (en) * 2017-02-13 2020-10-20 Protegrity Corporation Sensitive data classification
CN109766485A (en) * 2018-12-07 2019-05-17 中国电力科学研究院有限公司 A kind of sensitive information inspection method and system
US11704494B2 (en) * 2019-05-31 2023-07-18 Ab Initio Technology Llc Discovering a semantic meaning of data fields from profile data of the data fields
US11010287B1 (en) * 2019-07-01 2021-05-18 Intuit Inc. Field property extraction and field value validation using a validated dataset
CN110727880B (en) * 2019-10-18 2022-06-17 西安电子科技大学 Sensitive corpus detection method based on word bank and word vector model
US11886399B2 (en) * 2020-02-26 2024-01-30 Ab Initio Technology Llc Generating rules for data processing values of data fields from semantic labels of the data fields
CN113177233A (en) * 2021-05-31 2021-07-27 上海英方软件股份有限公司 Sensitive data identification method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487447A (en) * 2020-11-25 2021-03-12 平安信托有限责任公司 Data security processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113642030A (en) 2021-11-12

Similar Documents

Publication Publication Date Title
CN113642030B (en) Sensitive data multi-layer identification method
WO2019218699A1 (en) Fraud transaction determining method and apparatus, computer device, and storage medium
WO2020215571A1 (en) Sensitive data identification method and device, storage medium, and computer apparatus
WO2020000688A1 (en) Financial risk verification processing method and apparatus, computer device, and storage medium
CN108763952B (en) Data classification method and device and electronic equipment
US11461298B1 (en) Scoring parameter generation for identity resolution
US10803057B1 (en) Utilizing regular expression embeddings for named entity recognition systems
US11232182B2 (en) Open data biometric identity validation
WO2021164205A1 (en) Identity identification-based data auditing method and apparatus, and computer device
WO2020048056A1 (en) Risk decision method and apparatus
US11321486B2 (en) Method, apparatus, device, and readable medium for identifying private data
CN112948823A (en) Data leakage risk assessment method
CN113052577A (en) Method and system for estimating category of virtual address of block chain digital currency
US11295125B2 (en) Document fingerprint for fraud detection
US20210012026A1 (en) Tokenization system for customer data in audio or video
CN113268567A (en) Multi-attribute text matching method, device, equipment and storage medium
CN112990989A (en) Value prediction model input data generation method, device, equipment and medium
US20230039039A1 (en) Process for determining a degree of data exposure
US11886467B2 (en) Method, apparatus, and computer-readable medium for efficiently classifying a data object of unknown type
US11314897B2 (en) Data identification method, apparatus, device, and readable medium
CN114170000A (en) Credit card user risk category identification method, device, computer equipment and medium
CN114330533A (en) Equipment screen aging two-classification model training method and equipment screen aging detection method
EA039466B1 (en) Method and system for classifying data in order to detect confidential information in a text
CN116232760B (en) Fraud website identification early warning method, device, equipment and storage medium
CN114244558B (en) Injection attack detection method, injection attack detection device, computer equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant