CN113642030B

CN113642030B - Sensitive data multi-layer identification method

Info

Publication number: CN113642030B
Application number: CN202111194834.7A
Authority: CN
Inventors: 吕丹; 洪俊鑫
Original assignee: Guangdong Hongshu Technology Co ltd
Current assignee: Guangdong Hongshu Technology Co ltd
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2022-02-15
Anticipated expiration: 2041-10-14
Also published as: CN113642030A

Abstract

The invention relates to a sensitive data multilayer identification method, which comprises the steps of obtaining a data unit sample of a field to be detected, determining a sensitive classification rule matched with each data unit, then carrying out matching calculation on the data units according to the sensitive classification rule, determining the ratio of the number of the matched data units to the total number of the data unit samples according to the matching calculation, and obtaining the identification degree of a first sensitive rule. And further, loading the field to be detected into the sensitive data identification model, obtaining a second sensitive rule identification degree output by the sensitive data identification model, and finally judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree. Based on the method, the identification accuracy of the irregular field to be detected is improved through the sensitive data identification model.

Description

Sensitive data multi-layer identification method

Technical Field

The invention relates to the technical field of data security, in particular to a sensitive data multilayer identification method.

Background

With the rapid development of the internet, the data security problem causes extensive attention of the whole people, and the security incident of leakage of personal information and sensitive information may cause serious cyber crime. The traditional sensitive data discovery technology is particularly low in recognition rate of non-standard sensitive data, and sensitive data leakage risks caused by omission easily occur.

In the traditional sensitive data discovery technology, sensitive data is identified and positioned based on technical means such as regular expression matching, keyword code table mapping, data type definition discrimination, data characteristic calculation and the like, for the traditional technical means, the premise that the sensitive data can be accurately discovered is that the data quality is high, the data quality is poor due to the fact that the data acquisition process of certain enterprises is not standard, for example, some special characters exist in a client address field, key identification information such as province and city areas is lost, non-address data and the like, the identification accuracy rate by the traditional technical means is low, the requirement of the enterprises on the accuracy of sensitive data discovery cannot be met, the production cost of the enterprises can be improved due to too much manual intervention, and the private data of users can be indirectly leaked due to the omission of visual inspection.

It can be seen that the above drawbacks exist with conventional sensitive data discovery techniques.

Disclosure of Invention

Based on this, it is necessary to provide a multi-layer identification method for sensitive data, aiming at the defects existing in the conventional sensitive data discovery technology.

A sensitive data multi-layer identification method, comprising the steps of:

acquiring a data unit sample of a field to be detected, and determining a sensitive classification rule matched with each data unit; wherein the data unit samples comprise a plurality of data units;

performing matching calculation on the data units according to the sensitive classification rule;

determining the ratio of the number of matched data units to the total number of data unit samples according to matching calculation to obtain the identification degree of the first sensitive rule;

and when the identification degree of the first sensitive rule is greater than a preset sensitive threshold value, judging that the field to be detected belongs to the corresponding type of the sensitive classification rule.

According to the sensitive data multi-layer identification method, after the data unit samples of the fields to be detected are obtained and the sensitive classification rules matched with the data units are determined, the data units are subjected to matching calculation according to the sensitive classification rules, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to the matching calculation, and the first sensitive rule identification degree is obtained. And when the identification degree of the first sensitive rule is greater than a preset sensitive threshold value, judging that the field to be detected belongs to the corresponding type of the sensitive classification rule. Based on the method, the corresponding type of the sensitive classification rule of the sensitive data type of the field to be detected is automatically identified through rule matching of all data units in the field to be detected. Meanwhile, the accuracy of the corresponding type of the sensitive classification rule as the identification result is improved by limiting the preset sensitive threshold.

In one embodiment, the process of determining the sensitive classification rule matching each data unit includes the steps of:

and calculating the regular expression of the data unit, and calculating the sensitive rule matched with the data unit.

In one embodiment, the process of performing matching computation on data units according to sensitive classification rules includes the steps of:

and when the sensitive classification rule has a corresponding sensitive data characteristic code table, performing matching calculation on the character string characteristic identification of the data unit according to the sensitive data characteristic code table.

and when the sensitive classification rule needs to be strongly checked, performing matching calculation on the data unit according to the data rule corresponding to the strong check.

A sensitive data multi-layer identification method, comprising the steps of:

loading the field to be detected into the sensitive data identification model to obtain a second sensitive rule identification degree output by the sensitive data identification model;

and judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree.

According to the sensitive data multi-layer identification method, after the data unit samples of the fields to be detected are obtained and the sensitive classification rules matched with the data units are determined, the data units are subjected to matching calculation according to the sensitive classification rules, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to the matching calculation, and the first sensitive rule identification degree is obtained. And further, loading the field to be detected into the sensitive data identification model, obtaining a second sensitive rule identification degree output by the sensitive data identification model, and finally judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree. Based on the method, the identification accuracy of the irregular field to be detected is improved through the sensitive data identification model.

In one of the embodiments, the first and second electrodes are,

in one embodiment, the training process of the sensitive data recognition model comprises the following steps:

obtaining corpus data and stop words;

preprocessing the material data and stop words to obtain a preprocessing result;

and performing multiple times of model training according to the preprocessing result to obtain a sensitive data recognition model.

In one embodiment, the process of determining that the field to be tested belongs to the corresponding type of the greater of the first sensitivity rule identification degree and the second sensitivity rule identification degree includes the steps of:

and when the first sensitivity rule identification degree is greater than a preset sensitivity threshold value and the second sensitivity rule identification degree is greater than the preset sensitivity threshold value, judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitivity rule identification degree and the second sensitivity rule identification degree.

A sensitive data multi-layer identification method, comprising the steps of:

when the sensitive classification rule is matched with the data dictionary, counting the recognition degree of the third sensitive rule through the data dictionary; the data dictionary is bound with each sensitive rule in advance;

when the sensitive classification rule is not matched with the data dictionary, performing matching calculation on the data unit according to the sensitive classification rule;

and judging the corresponding type of the field to be detected according to the greater one of the first sensitive rule identification degree and the second sensitive rule identification degree and the third sensitive rule identification degree.

According to the sensitive data multi-layer identification method, after the data unit samples of the fields to be detected are obtained, the sensitive classification rules matched with the data units are determined, when the sensitive classification rules are matched with the data dictionary, the third sensitive rule identification degree is counted through the data dictionary, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to matching calculation, the first sensitive rule identification degree is obtained, the fields to be detected are loaded into the sensitive data identification model, and the second sensitive rule identification degree output by the sensitive data identification model is obtained. And further, according to the larger one of the first sensitivity rule identification degree and the second sensitivity rule identification degree and the third sensitivity rule identification degree, judging the corresponding type of the field to be detected. Based on the method, through the preset attributes of the data dictionary, the fields to be detected of various users can be conveniently identified, the AI identification limitation of the sensitive data identification model is eliminated, and the identification range and accuracy are improved.

Drawings

FIG. 1 is a flow diagram of one embodiment of a method for multi-level identification of sensitive data;

FIG. 2 is a flow chart of another embodiment of a method for multi-level identification of sensitive data;

FIG. 3 is a flow diagram of one embodiment of another sensitive data multi-tier identification method;

FIG. 4 is a flow diagram of a sensitive data recognition model training method according to an embodiment;

FIG. 5 is a flow chart of an embodiment of a multi-tier sensitive data identification method;

FIG. 6 is a block diagram of an embodiment of a sensitive data multiple identification device;

fig. 7 is a schematic diagram of the internal structure of a computer according to another embodiment.

Detailed Description

For better understanding of the objects, technical solutions and effects of the present invention, the present invention will be further explained with reference to the accompanying drawings and examples. Meanwhile, the following described examples are only for explaining the present invention, and are not intended to limit the present invention.

The embodiment of the invention provides a sensitive data multilayer identification method.

Fig. 1 is a flowchart of an embodiment of a sensitive data multi-layer identification method, and as shown in fig. 1, an embodiment of a sensitive data multi-layer identification method includes steps S100 to S103:

s100, acquiring a data unit sample of a field to be detected, and determining a sensitive classification rule matched with each data unit; wherein the data unit samples comprise a plurality of data units;

s101, performing matching calculation on the data units according to the sensitive classification rules;

s102, determining the ratio of the number of matched data units to the total number of data unit samples according to matching calculation, and obtaining the identification degree of a first sensitive rule;

s103, when the identification degree of the first sensitive rule is larger than a preset sensitive threshold value, the field to be detected is judged to belong to the corresponding type of the sensitive classification rule.

The field to be tested comprises a plurality of data units, and each data unit forms a data unit sample. Wherein the data unit represents a value of a table field, and a plurality of values form the table field to further form the field to be tested.

The sensitive classification rule is a classification rule that a Data Security Center (DSC Data Security Center) supports customized sensitive information, namely one of sensitive Data identification rules. The data security center identifies and alarms the sensitive data in the file or the table through the sensitive data identification rule, and a user can define and manage the sensitive identification rule according to business needs to determine various sensitive rules. For example, sensitive rules include name sensitive rules and phone number sensitive rules — name sensitive rules contain regular expressions, common name identification; the telephone number sensitivity rule comprises a regular expression, identification of the first three special service numbers and the like.

Therefore, the data unit can determine the matched sensitive data identification rule, namely the sensitive classification rule according to the DSC customization in advance. The definition and matching identification of the sensitive data identification rule on the data unit can be realized by using keywords, regular expressions or built-in algorithms. For example, when the DSC has been configured with sensitive data identification rules of the types such as the mobile phone number and the identification number in advance, the sensitive classification rule corresponding to the data unit may be determined to include the sensitive data identification rule corresponding to the types such as the mobile phone number and the identification number according to the data unit matching.

Based on the sensitive data identification rule, classifying the data of the field to be detected comprises performing sensitive grade classification and rule classification on the field to be detected. In the DSC system, the sensitivity registration comprises an unknown sensitivity level, a low sensitivity level, a medium sensitivity level, a high sensitivity level and the like; the rule classification comprises personal sensitive information, equipment sensitive information, key sensitive information, sensitive picture information, enterprise sensitive information, position sensitive information, general sensitive information and the like.

For example, when defining the sensitive data identification rule, the content of the sensitive data identification rule is defined according to rule classification to match the data unit of the field to be measured.

And when the sensitive classification rule is identified by keyword matching, matching the sensitive classification rule matched with the data unit according to the sensitive data keyword field input in advance in the keyword text box and the data unit.

In one embodiment, fig. 2 is a flowchart of another implementation of a sensitive data multi-layer identification method, as shown in fig. 2, a process of determining a sensitive classification rule matching each data unit in step S100 includes step S200:

s200, performing regular expression calculation on the data units, and calculating the sensitive rules matched with the data units.

And calculating the sensitive rule matched with the data unit through the regular expression, namely a sensitive classification rule. For example, in the pre-resume of the sensitive data recognition rule, regular expressions of corresponding types are pre-entered in the rule definition text box. In the matching calculation, the data unit is calculated according to the regular expression, and the sensitive classification rule is determined.

In one embodiment, regular expression calculation is performed through regular expression matching, the system performs regular matching on the data unit according to the regular expression configured by the sensitive rule, and if the matching is successful, the data unit can be preliminarily judged to meet the matched sensitive rule.

Based on the method, the data format of the data unit in the field to be detected is determined through the matching of the sensitive classification rules, and the first-layer identification of the sensitive data is completed.

On the basis that the S100 completes the matching of the sensitive classification rule, further, the data unit is subjected to matching calculation through the sensitive classification rule, and the data characteristics of the data unit are determined, so that the second-layer identification of the sensitive data after the data format is determined is completed.

The data characteristics of the data unit comprise character strings, data conversion algorithms or sensitive word fusion and the like. And S101, performing matching calculation on the data units according to the sensitive classification rules, including performing feature matching calculation according to the sensitive classification rules, and determining data features corresponding to the data units.

In one embodiment, as shown in fig. 2, the process of performing matching calculation on the data unit according to the sensitive classification rule in step S101 includes step S201:

s201, when the sensitive classification rule has a corresponding sensitive data feature code table, matching calculation is carried out on the character string feature identification of the data unit according to the sensitive data feature code table.

In the presetting of the sensitive data identification rule, the data characteristics are pre-configured, and a sensitive data characteristic code table is determined. And the sensitive data feature code table corresponds to the sensitive classification rule. And performing matching calculation on the character string feature identification of the data unit according to the character string pre-stored in the sensitive data feature code table, and determining the data feature of the data unit according to character matching.

For example, the sensitive data feature code table includes a common name code table, a special service number table of the first three digits of the telephone number, a regional code table of the first six digits of the identity card, an address area code table, and the like, and the system performs matching verification on the data through the feature code table.

And according to the determined data feature matching condition, calculating the proportion of the data features in the field to be detected which accord with the matching calculation, determining the ratio of the number of the matched data units to the total number of the data unit samples, and determining the first sensitive rule identification degree according to the ratio. In one embodiment, the first sensitivity rule identification is determined according to the product of the ratio and the correction coefficient.

In one embodiment, as shown in fig. 2, the process of performing matching calculation on the data unit according to the sensitive classification rule in step S101 further includes step S202:

s202, when the sensitive classification rule needs to be strongly verified, matching calculation is carried out on the data unit according to the data rule corresponding to the strong verification.

When the sensitive data identification rules are preset, the data rules of all sensitive data identification rule team members can be determined, and when the sensitive classification rules need to be strongly verified, the data units are subjected to matching calculation according to the data rules corresponding to the strong verification.

The following explains the data rule corresponding to the strong check with a specific example:

1. the last digit of the identification number is a check code, and calculation is required according to the national standard check rule.

2. The organization code, the national organization code is composed of eight data body codes and one digit (or capital Latin letter) check code, the check formula is as follows:

3. and the bank card number is verified by adopting a Luhn algorithm.

In one embodiment, the check code of the data unit is subjected to matching calculation through the data rule to complete strong check.

And after judging that the field to be detected belongs to the corresponding type of the sensitive classification rule, carrying out data classification and classification on the field to be detected according to the sensitive data identification rule to which the sensitive classification rule belongs.

In one embodiment, the sensitive rule set threshold comprises 0-100%. The sensitive rule setting threshold can be set according to data quality, and can be set to be more than 50% when the data quality is high, and can be set to be less than 50% when the data quality is low. As a preferred embodiment, the sensitive rule sets the threshold to 50%.

According to the sensitive data multilayer identification method, after the data unit samples of the fields to be detected are obtained and the sensitive classification rules matched with the data units are determined, the data units are subjected to matching calculation according to the sensitive classification rules, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to the matching calculation, and the first sensitive rule identification degree is obtained. And when the identification degree of the first sensitive rule is greater than a preset sensitive threshold value, judging that the field to be detected belongs to the corresponding type of the sensitive classification rule. Based on the method, the corresponding type of the sensitive classification rule of the sensitive data type of the field to be detected is automatically identified through rule matching of all data units in the field to be detected. Meanwhile, the accuracy of the corresponding type of the sensitive classification rule as the identification result is improved by limiting the preset sensitive threshold.

Further, the sensitive data multilayer identification method identifies the sensitive data based on data format, data characteristics and data rule strong check. However, when the data format of the field to be detected is relatively disordered and has no obvious format and characteristics, the identification precision of the sensitive data multilayer identification method is low. Based on the method, the embodiment of the invention also provides another sensitive data multi-layer identification method.

Fig. 3 is a flowchart of an embodiment of another sensitive data multi-layer identification method, and as shown in fig. 3, an embodiment of the other sensitive data multi-layer identification method includes steps S300 to S304:

s300, acquiring a data unit sample of a field to be detected, and determining a sensitive classification rule matched with each data unit; wherein the data unit samples comprise a plurality of data units;

s301, performing matching calculation on the data units according to the sensitive classification rules;

s302, determining the ratio of the number of the matched data units to the total number of the data unit samples according to matching calculation, and obtaining the identification degree of the first sensitive rule;

s303, loading the field to be detected into the sensitive data identification model to obtain a second sensitive rule identification degree output by the sensitive data identification model;

s304, judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree.

Wherein the sensitive data recognition model is implemented in pre-training. Fig. 4 is a flowchart of a sensitive data recognition model training method according to an embodiment, and as shown in fig. 4, a training process of a sensitive data recognition model includes steps S400 to S402:

s400, obtaining corpus data and stop words;

s401, preprocessing the material data and stop words to obtain a preprocessing result;

and S402, performing model training for multiple times according to the preprocessing result to obtain a sensitive data recognition model.

The corpus data and the stop words are obtained from the database, and basic data are provided for model training. In one embodiment, the corpus data includes extracted table field data.

In one embodiment, stop words include special characters, other character strings, etc. that are not relevant to the corpus data.

The corpus data and stop words are converted into parametric data, e.g., vectorized data, that is adapted to the sensitive data recognition model by preprocessing the corpus data and stop words.

In one embodiment, the process of preprocessing the text data and stop words in step S401 to obtain a preprocessing result includes the steps of: and encapsulating the corpus data and the word stopping parameters to obtain parameters serving as preprocessing results.

And encapsulating the corpus data and stop words into parameters acceptable by a sensitive data recognition model as a preprocessing result.

In one embodiment, the process of encapsulating the corpus data and the stop word parameter to obtain the parameter as the preprocessing result includes the steps of:

performing word segmentation processing on the corpus data to obtain a word segmentation list;

removing stop words of the word segmentation list to obtain a targeted word segmentation list;

and packaging the targeted word segmentation list into vectorized parameters as a preprocessing result.

And the stop words of the neglected word segmentation list are removed, so that the subsequent model training is more targeted.

Wherein the model training comprises algebraic training. In one embodiment, the sensitive data recognition model comprises a text classification model, including a Doc2vec model or a word2vec model, and the like. As a preferred embodiment, the sensitive data identification model is a Doc2vec model.

In one embodiment, more than 10-20 times of model training are carried out, and 10-20 generations of sensitive data recognition models are generated.

In one embodiment, the process of performing multiple model training according to the preprocessing result in step S402 includes the steps of:

and performing more than 10 times of model training according to the preprocessing result.

And performing more than 10 times of model training on the preprocessing result to obtain a 10-generation sensitive data recognition model.

And after the sensitive data identification model is determined, taking the field to be detected as sample data to be identified of the sensitive data identification model.

Wherein, the sample data comprises quantitative data for extracting a certain table field.

Preprocessing sample data to obtain a sample processing result;

and packaging the sample data into parameters acceptable by the sensitive data identification model as a sample processing result.

In one embodiment, the process of preprocessing sample data to obtain a sample processing result includes the steps of:

and packaging the sample data parameters to obtain parameters serving as sample processing results.

Specifically, word segmentation processing is carried out on the sample data to generate a word segmentation list, and the word segmentation list is packaged into vector parameters acceptable by the sensitive data identification model.

And loading the sample processing result into the sensitive data identification model to obtain the identification rate which is output by the sensitive data identification model and is used as the identification result.

In one embodiment, one sensitive rule corresponds to one sensitive data identification model, the system loads all models, identifies data unit samples and outputs matched models, and then the sensitive rule corresponding to the model can be obtained.

In one embodiment, the process of loading the sample processing result into the sensitive data recognition model and obtaining the recognition rate output by the sensitive data recognition model as the recognition result includes the following steps:

and loading the sample processing result into the sensitive data identification model, outputting the identification rate as an identification result when the identification rate output by the sensitive data identification model is greater than a preset sensitive threshold, and otherwise, controlling the sensitive data identification model to repeat model operation.

And the identification result is the identification degree of the second sensitive rule.

Wherein the preset sensitivity threshold = number of identified samples/total number of samples, including 70% to 90%. As a preferred embodiment, the preset sensitivity threshold is 80%. And when the recognition rate output by the sensitive data recognition model is greater than 80%, outputting the recognition rate as a recognition result, otherwise, controlling the sensitive data recognition model to repeat model operation.

In one embodiment, the step S304 of determining that the field to be tested belongs to the corresponding type of the greater one of the first sensitivity rule identification degree and the second sensitivity rule identification degree includes the steps of:

And when the first sensitivity rule identification degree and the second sensitivity rule identification degree are both greater than a preset sensitivity threshold value, performing comparison judgment.

And judging the corresponding type of the field to be detected, which belongs to the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree, namely the corresponding sensitive data identification rule, so as to classify and grade the data.

According to the other sensitive data multi-layer identification method, after the data unit samples of the fields to be detected are obtained and the sensitive classification rules matched with the data units are determined, the data units are subjected to matching calculation according to the sensitive classification rules, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to the matching calculation, and the identification degree of the first sensitive rule is obtained. And further, loading the field to be detected into the sensitive data identification model, obtaining a second sensitive rule identification degree output by the sensitive data identification model, and finally judging that the field to be detected belongs to the corresponding type of the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree. Based on the method, the identification accuracy of the irregular field to be detected is improved through the sensitive data identification model.

Further, the embodiment of the invention also provides another sensitive data multilayer identification method.

Fig. 5 is a flowchart of an embodiment of a multi-layer sensitive data recognition method, and as shown in fig. 5, an embodiment of the multi-layer sensitive data recognition method includes steps S500 to S505:

s500, acquiring a data unit sample of a field to be detected, and determining a sensitive classification rule matched with each data unit; wherein the data unit samples comprise a plurality of data units;

s501, when the sensitive classification rule is matched with the data dictionary, counting the recognition degree of a third sensitive rule through the data dictionary; the data dictionary is bound with each sensitive rule in advance;

s502, when the sensitive classification rule is not matched with the data dictionary, performing matching calculation on the data unit according to the sensitive classification rule;

s503, determining the ratio of the number of the matched data units to the total number of the data unit samples according to matching calculation, and obtaining the identification degree of the first sensitive rule;

s504, loading the field to be detected into the sensitive data identification model, and obtaining a second sensitive rule identification degree output by the sensitive data identification model;

and S505, judging the corresponding type of the field to be detected according to the greater one of the first sensitive rule identification degree and the second sensitive rule identification degree and the third sensitive rule identification degree.

And performing recognition degree matching calculation of the third sensitive rule through a pre-established data dictionary. The data dictionary stores user-defined codes; and the user-defined code is used for calculating the identification degree of the third sensitive rule. Based on the method, the data dictionary is bound with each sensitive rule in advance, and various sensitive rules including sensitive data identification rules are determined through the data dictionary.

And when the sensitive classification rule can be matched with the data dictionary, counting the recognition degree of the third sensitive rule through the data dictionary. In one embodiment, the sensitive classification rule is determined by data format matching, i.e., by data format records of the data dictionary, and whether the sensitive classification rule matches the data dictionary is determined.

In one embodiment, the step S505 of determining the corresponding type of the field to be tested according to the greater of the first sensitive rule identification degree and the second sensitive rule identification degree and the third sensitive rule identification degree includes the steps of:

and judging that the field to be detected belongs to the corresponding type of the larger one and the larger one in the third sensitivity rule identification degree.

The method comprises the steps of comparing a first sensitive rule identification degree with a second sensitive rule identification degree, comparing the first sensitive rule identification degree with a third sensitive rule identification degree to determine the maximum sensitive rule identification degree, carrying out type correspondence of sensitive rules, determining sensitive data identification rules, and carrying out data classification and classification.

In the above another sensitive data multi-layer identification method, after the data unit samples of the fields to be detected are obtained and the sensitive classification rules matched with the data units are determined, when the sensitive classification rules are matched with the data dictionary, the third sensitive rule identification degree is counted by the data dictionary, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to matching calculation, the first sensitive rule identification degree is obtained, and the fields to be detected are loaded into the sensitive data identification model, so that the second sensitive rule identification degree output by the sensitive data identification model is obtained. And further, according to the larger one of the first sensitivity rule identification degree and the second sensitivity rule identification degree and the third sensitivity rule identification degree, judging the corresponding type of the field to be detected. Based on the method, through the preset attributes of the data dictionary, the fields to be detected of various users can be conveniently identified, the AI identification limitation of the sensitive data identification model is eliminated, and the identification range and accuracy are improved.

The embodiment of the invention also provides a sensitive data multilayer identification device.

Fig. 6 is a block diagram of a sensitive data multiple-recognition apparatus according to an embodiment, as shown in fig. 6, including:

the sensitive classification rule matching module 100 is configured to obtain a data unit sample of a field to be detected, and determine a sensitive classification rule matched with each data unit;

the first matching calculation module 101 is used for performing matching calculation on the data units according to the sensitive classification rule; or, when the sensitive classification rule is not matched with the data dictionary, performing matching calculation on the data unit according to the sensitive classification rule;

the first identification degree calculation module 102 is configured to determine a ratio of the number of matched data units to the total number of data unit samples according to matching calculation, and obtain a first sensitive rule identification degree;

the first judging module 103 is used for judging that the field to be detected belongs to the corresponding type of the sensitive classification rule when the identification degree of the first sensitive rule is greater than a preset sensitive threshold value;

the second recognition degree calculation module 104 is configured to load the field to be detected into the sensitive data recognition model, and obtain a second sensitive rule recognition degree output by the sensitive data recognition model;

a second determination module 105, configured to determine that the field to be detected belongs to the corresponding type of the greater one of the first sensitivity rule identification degree and the second sensitivity rule identification degree;

the third recognition degree calculating module 106 is configured to count recognition degrees of the third sensitive rule through the data dictionary when the sensitive classification rule is matched with the data dictionary; the data dictionary is bound with each sensitive rule in advance;

and a third determining module 107, configured to determine, according to the greater of the first sensitivity rule identification degree and the second sensitivity rule identification degree, the corresponding type to which the field to be detected belongs, with the third sensitivity rule identification degree.

According to the sensitive data multilayer recognition device, after the data unit samples of the fields to be detected are obtained, the sensitive classification rules matched with the data units are determined, when the sensitive classification rules are matched with the data dictionary, the third sensitive rule recognition degree is counted through the data dictionary, the ratio of the number of the matched data units to the total number of the data unit samples is determined according to matching calculation, the first sensitive rule recognition degree is obtained, the fields to be detected are loaded into the sensitive data recognition model, and the second sensitive rule recognition degree output by the sensitive data recognition model is obtained. And further, according to the larger one of the first sensitivity rule identification degree and the second sensitivity rule identification degree and the third sensitivity rule identification degree, judging the corresponding type of the field to be detected. Based on the method, through the preset attributes of the data dictionary, the fields to be detected of various users can be conveniently identified, the AI identification limitation of the sensitive data identification model is eliminated, and the identification range and accuracy are improved.

The embodiment of the invention also provides a computer storage medium, on which computer instructions are stored, and when the instructions are executed by a processor, the sensitive data multi-layer identification method of any one of the above embodiments is realized.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a RAM, a ROM, a magnetic or optical disk, or various other media that can store program code.

Corresponding to the computer storage medium, in one embodiment, a computer device is further provided, where the computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the multi-layer identification method for sensitive data in any of the above embodiments is implemented.

The computer device may be a terminal, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a sensitive data multi-tier identification method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

After the data unit samples of the fields to be detected are obtained and the sensitive classification rules matched with the data units are determined, when the sensitive classification rules are matched with the data dictionary, the computer device counts the recognition degree of the third sensitive rule through the data dictionary, determines the ratio of the number of the matched data units to the total number of the data unit samples according to matching calculation, obtains the recognition degree of the first sensitive rule, loads the fields to be detected into the sensitive data recognition model, and obtains the recognition degree of the second sensitive rule output by the sensitive data recognition model. And further, according to the larger one of the first sensitivity rule identification degree and the second sensitivity rule identification degree and the third sensitivity rule identification degree, judging the corresponding type of the field to be detected. Based on the method, through the preset attributes of the data dictionary, the fields to be detected of various users can be conveniently identified, the AI identification limitation of the sensitive data identification model is eliminated, and the identification range and accuracy are improved.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A multi-layer identification method for sensitive data, comprising the steps of:

when the sensitive classification rule is matched with a data dictionary, counting the recognition degree of a third sensitive rule through the data dictionary; the data dictionary is bound with each sensitive rule in advance; the data dictionary stores user-defined codes; the user-defined code is used for calculating the recognition degree of the third sensitive rule;

determining the ratio of the number of the matched data units to the total number of the data unit samples according to the matching calculation to obtain a first sensitive rule identification degree;

loading the field to be detected into a sensitive data identification model to obtain a second sensitive rule identification degree output by the sensitive data identification model;

and judging the corresponding type of the field to be detected according to the larger one of the first sensitive rule identification degree and the second sensitive rule identification degree and the third sensitive rule identification degree.

2. The sensitive data multi-layer identification method according to claim 1, wherein the process of determining the corresponding type of the field to be detected according to the greater of the first sensitive rule identification degree and the second sensitive rule identification degree and the third sensitive rule identification degree comprises the steps of:

and judging that the field to be tested belongs to the corresponding type of the larger one and the larger one in the third sensitivity rule identification degree.

3. The sensitive data multi-layer identification method according to claim 1, wherein the process of determining the sensitive classification rule matched with each data unit comprises the steps of:

4. The multi-layer identification method for sensitive data according to claim 1, wherein the process of performing matching calculation on the data unit according to the sensitive classification rule comprises the following steps:

and when the sensitive classification rule has a corresponding sensitive data feature code table, performing matching calculation on the character string feature identification of the data unit according to the sensitive data feature code table.

5. The multi-layer identification method for sensitive data according to claim 1, wherein the process of performing matching calculation on the data unit according to the sensitive classification rule comprises the following steps:

6. The sensitive data multi-layer identification method according to claim 1, wherein the training process of the sensitive data identification model comprises the steps of:

obtaining corpus data and stop words;

preprocessing the corpus data and the stop words to obtain a preprocessing result;