CN111914294B

CN111914294B - Database sensitive data identification method and system

Info

Publication number: CN111914294B
Application number: CN202010762510.8A
Authority: CN
Inventors: 欧阳解文; 魏茜; 叶俊
Original assignee: CCB Finetech Co Ltd
Current assignee: CCB Finetech Co Ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2023-06-30
Anticipated expiration: 2040-07-31
Also published as: CN111914294A

Abstract

The invention provides a database sensitive data identification method and a system, wherein the method comprises the following steps: connecting a database to obtain data to be identified; carrying out characterization processing on the data to be identified to obtain characterization data reflecting the change of the field content information quantity in the data to be identified; determining the classification and desensitization range of sensitive data of each field content in the data to be identified according to the characteristic data and the data characteristic identification rule of each field content in the data to be identified; the data characteristic recognition rule comprises a corresponding relation between sensitive data classification and a discrimination rule containing a characteristic data range and a desensitization range calculation rule. The method can also implement the sensitive data discovery work under the condition of not mastering the data overall view, reduces the blind area of the sensitive data discovery and the workload of manual intervention, and has good universality and application value.

Description

Database sensitive data identification method and system

Technical Field

The present disclosure relates to the field of database data analysis, and in particular, to a method and system for identifying sensitive database data.

Background

In the current data age, data becomes a core asset of an enterprise, and the safety protection, effective mastering and reasonable utilization of the data asset are all important propositions of attention of the enterprise. For enterprises grasping a large amount of structured data, whether a large amount of scattered independent application databases or data warehouses storing mass data, a simple, practical and effective database scanning tool is needed to ascertain the distribution of sensitive data and realize hierarchical classification management. The existing sensitive data identification scheme has the remarkable characteristics that based on the known sensitive data or the clear data, a rule or a model is redesigned to identify and discover the sensitive data. The existing sensitive data identification scheme has the following defects:

1) Only if the presence of sensitive data is known can sensitive data be discovered. Changes to the data owners caused by the change in the organization are common, and new data owners may not know the existence of sensitive data, which is very easy to cause data security accidents.

2) Only known sensitive data can be identified, and unknown sensitive data cannot be identified. The discovery rules are preset for the known sensitive data fields in the prior art, and the potential or unknown sensitive data cannot be discovered. This creates a dead zone for sensitive data discovery, with potential data security implications.

3) Different rules are required to be set for different sensitive data, and the problems of complicated identification, low practicability and more manual intervention are solved.

4) Sensitive data is not classified and classified, so that the sensitive data is not convenient to manage.

5) The manual determination of the desensitization strategy has the problem of great manual effort.

Disclosure of Invention

This document serves to address the following drawbacks in the prior art: only known sensitive data can be identified, and unknown sensitive data cannot be identified; the identification of different sensitive data has different rules, and the desensitization strategy needs to be determined manually, so that the problems of complicated identification, low practicability and more manual intervention are solved; sensitive data is not classified and graded, so that the sensitive data is not convenient to manage.

To solve the above technical problem, a first aspect of the present disclosure provides a database sensitive data identification method, including:

connecting a database to obtain data to be identified;

carrying out characterization processing on the data to be identified to obtain characterization data reflecting the change of the field content information quantity in the data to be identified;

determining the classification and desensitization range of sensitive data of each field content in the data to be identified according to the characteristic data and the data characteristic identification rule of each field content in the data to be identified;

The data characteristic recognition rule comprises a corresponding relation between sensitive data classification and a discrimination rule containing a characteristic data range and a desensitization range calculation rule.

In a further embodiment, connecting the database to obtain the data to be identified comprises:

and connecting the database according to the database connection information and the data quantity information configured by the user to acquire the data to be identified.

In a further embodiment, the characterization data includes at least:

and in the data to be identified, the empty rate nullProb of each field content, the original information amount originalEntropy of each field content, the length information amount lenEntropy of each field content, the maximum information amount occupation ratio maxEntropyProp of each field content, the maximum length lmax of each field content and the maximum reserved length keepLen of each field content are set.

In a further embodiment, performing a characterization process on the data to be identified to obtain characterization data reflecting a change in information quantity of field content in the data to be identified, including:

circularly processing the data to be identified according to fields to obtain the empty rate nullProb of each field content, the maximum length lmax of each field content, the original information quantity originalEntropy of each field content and the length information quantity lenEntropy of each field content;

Calculating the maximum information quantity ratio maxEntropyProp of each field according to the original information quantity originalEntropy of each field;

and carrying out cyclic processing on the data to be identified according to the intercepting length of the field content to obtain the maximum reserved length keepLen of each field content.

In a further embodiment, calculating the maximum information amount duty ratio maxentropyProp of each field according to the original information amount originalEntropy of each field comprises:

selecting the maximum original information amount as the MaxEntropy from all field content original information amounts originalEntropy;

the ratio of the original information amount originalEntropy to the maximum original information amount of the MaxEntropy is taken as the maximum information amount duty ratio maxEntropyProp of each field.

In a further embodiment, performing a cyclic processing on the data to be identified according to a length of intercepting field content to obtain a maximum reserved length keepLen of each field content, including:

for each field, intercepting data from each data item in the field content according to the sequence from 1 to the maximum length lmax of the field content to obtain a plurality of sub-contents;

calculating the information entropy split entropy of each sub-content in each field one by utilizing an information entropy function;

Calculating to obtain the information quantity ratio lenEntProp of each sub-content in each field according to the information entropy split Entropy of each sub-content in each field and the original information quantity originalEntropy of each field content;

and determining the maximum reserved length keepLen of the contents of each field according to the interception length corresponding to the sub-content information quantity occupying ratio lenEntProp larger than a preset value in each field.

In a further embodiment, performing a cyclic processing on the data to be identified according to fields to obtain a null rate null prob of contents of each field, including: counting the total number of data items in the contents of each field of the data to be identified and the number of empty data items;

dividing the number of empty data items in each field content by the total number of the data items to obtain the empty rate nullProb of each field content.

In a further embodiment, performing a cyclic processing on the data to be identified according to fields to obtain a maximum length lmax of contents of each field, including:

counting the character string length of each data item in each field content of the data to be identified;

the maximum string length in each field content is taken as the maximum length lmax of each field content.

In a further embodiment, performing a loop processing on the data to be identified according to fields to obtain the content length information quantity lenEntropy of each field, including:

And inputting the contents of each field of the data to be identified into an information entropy tool function, and calculating to obtain the length information quantity lenEntropy of the contents of each field.

In a further embodiment, determining the classification and desensitization range of sensitive data of each field content in the data to be identified according to the characteristic data and the data characteristic identification rule of each field content in the data to be identified includes:

determining the classification of sensitive data of each field content according to the empty rate nullProb of each field content, the original information amount originalEntropy of each field content, the length information amount lenEntropy of each field content, the maximum information amount occupation ratio maxEntropyProp of each field and the corresponding relation in the data to be identified;

and determining the desensitization range of each field according to the maximum length lmax of each field content in the data to be identified, the maximum reserved length keepLen of each field content and the desensitization range calculation rule.

In a further embodiment, determining the classification of the sensitive data of each field content according to the null rate nullProb of each field content in the data to be identified, the original information amount originalEntropy of each field content, the length information amount lenEntropy of each field content, the maximum information amount occupation ratio maxentromropy prop of each field and the correspondence, includes:

Searching a conforming judgment rule from the corresponding relation according to the empty rate nullProb of each field content, original information amount originalEntropy of each field content, length information amount lenEntropy of each field content and maximum information amount occupation ratio maxEntropyProp of each field in the data to be identified;

and taking the sensitive data classification corresponding to the searched discrimination rule as the sensitive data classification of the field content in the data to be identified.

In a further embodiment, determining the desensitization range of each field according to the maximum length lmax of each field content in the data to be identified, the maximum reserved length keepLen of each field content and the rule for calculating the desensitization range includes:

and determining a starting desensitization position according to the rule of computing the desensitization range and the maximum reserved length keepLen of each field content, and determining an ending desensitization position according to the maximum length lmax of each field content.

In a second aspect herein, there is provided a database sensitive data identification system comprising:

the database connection module is used for connecting the database to acquire data to be identified;

the characterization processing module is used for performing characterization processing on the data to be identified to obtain characterization data reflecting the change of the field content information quantity in the data to be identified;

The sensitive data identification module is used for determining the sensitive data classification and the desensitization range of each field content in the data to be identified according to the characteristic data and the data characteristic identification rule of each field content in the data to be identified;

In a further embodiment, the characterization data includes at least:

In a further embodiment, the characterization processing module includes:

the first characterization unit is used for carrying out cyclic processing on the data to be identified according to fields to obtain a null rate nullProb of each field content, a maximum length lmax of each field content, an original information amount originalEntropy of each field content and a length information amount lenEntropy of each field content;

a second characterization unit, configured to calculate a maximum information amount ratio maxentropyProp of each field according to the original information amount originalEntropy of each field;

And the third characterization unit is used for carrying out cyclic processing on the data to be identified according to the intercepting length of the field content to obtain the maximum reserved length keepLen of each field content.

In a further embodiment, the second characterization unit calculates a maximum information amount ratio maxentroyprop of each field according to an original information amount originalEntropy of each field, including:

In a further embodiment, the third characterization unit performs a cyclic processing on the data to be identified according to a field content interception length to obtain a maximum reserved length keepLen of each field content, and includes:

In a further embodiment, the sensitive data identification module comprises:

the sensitive data classification and identification unit is used for determining the sensitive data classification of each field content according to the empty rate nullProb of each field content, the maximum information quantity duty ratio maxEntropy of each field content, the length information quantity lenEntropy of each field content, the original information quantity originalEntropy of each field and the corresponding relation in the data to be identified;

the desensitization range identification unit is used for determining the desensitization range of each field according to the maximum length lmax of each field content in the data to be identified, the maximum reserved length keepLen of each field content and the desensitization range calculation rule.

In a further embodiment, the sensitive data classification and identification unit is specifically configured to:

In a further embodiment, the desensitization range identification unit is specifically configured to:

according to the desensitization range calculation rule, determining a desensitization starting position according to the maximum reserved length keepLen of each field content; the end desensitization position is determined according to the maximum length lmax of the contents of each field.

A third aspect herein provides a computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding embodiments when the computer program is executed.

A fourth aspect herein provides a computer readable storage medium storing a computer program for executing a method according to any one of the preceding embodiments when executed by a processor.

The database sensitive data identification method and the database sensitive data identification system provided by the invention can obtain the characterization data reflecting the change of the information quantity of the field content in the data to be identified by carrying out characterization processing on the data to be identified, and determine the classification and the desensitization range of the sensitive data of each field content in the data to be identified according to the characterization data and the data characteristic identification rule of each field content in the data to be identified, thereby realizing the following technical effects:

1) Under the condition of incompletely mastering the data, the sensitive data can be identified, and potential sensitive data in the mass data can be found;

2) The universal unified data characteristic recognition rule is adopted, so that the use threshold is reduced and the workload is reduced;

3) Sensitive data are finely classified, so that further fine management of the data is facilitated;

4) Under the condition of ensuring the data to have stronger desensitization safety degree, the proposal desensitization range of sensitive data can be given, and the trade-off difficult problem between the data safety and the data mining analysis is solved to a great extent.

The foregoing and other objects, features and advantages will be apparent from the following more particular description of preferred embodiments, as illustrated in the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments herein or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments herein and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 illustrates a flow chart of a database sensitive data identification method of embodiments herein;

FIG. 2 illustrates a two-dimensional block diagram of data to be identified in accordance with an embodiment herein;

FIG. 3 illustrates a flow chart of a sensitive data classification and desensitization scope determination process of embodiments herein;

FIG. 4 is a schematic illustration showing the recognition results of the embodiments herein;

FIG. 5 illustrates a block diagram of a database sensitive data identification system of embodiments herein;

FIG. 6 shows a block diagram of a characterization processing module of an embodiment herein

FIG. 7 illustrates a block diagram of a sensitive data identification module of an embodiment herein;

FIG. 8 illustrates a block diagram of a computer device of embodiments herein.

Description of the drawings:

802. a computer device;

804. a processor;

806. a memory;

808. a driving mechanism;

810. an input/output module;

812. an input device;

814. an output device;

816. a presentation device;

818. a graphical user interface;

820. a network interface;

822. a communication link;

824. a communication bus.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the disclosure. All other embodiments, based on the embodiments herein, which a person of ordinary skill in the art would obtain without undue burden, are within the scope of protection herein.

The terms used herein are explained as follows:

data to be identified: refers to data in a user database, or various data stored for use by an end user, which may or may not be in machine-readable form, as unprocessed or simplified data.

Sensitive data: the personal sensitive information data include name, ID card number, address, telephone, bank account number, mailbox, cipher, medical information, education background, etc.

Data desensitization: the method refers to the deformation of data of certain sensitive information through a desensitization rule, so that the reliable protection of sensitive privacy data is realized.

As shown in fig. 1, fig. 1 shows a flowchart of a database sensitive data identification method of embodiments herein. The embodiment can also implement the sensitive data discovery work under the condition of not mastering the data overall view, reduces the blind area of the sensitive data discovery and the workload of manual intervention, and has good universality and application value. Specifically, the database sensitive data identification method comprises the following steps:

step 100, connecting the database to obtain the data to be identified.

In implementation, the database may be connected according to a configuration file, where database connection information and related performance parameter information are recorded in the configuration file, where the database connection information includes a driver type of the database, for example, necessary connection information such as com. The database connection information can be connected to a specified database, and the supported database type is, for example, mysql relational database, hive and other large data warehouse based on Hapdoop. The data volume of the acquisition table can be limited by the related performance parameter information, so that the calculation resources are reduced and the execution efficiency is ensured. After the database is connected, metadata of the table is firstly obtained, including all table names and field names, field types and field comments of the table, then data content of the table is obtained, the data content of the table and the metadata of the table form data to be identified together, the data content of the table is shown in fig. 2 and is in an m×n two-dimensional structure, wherein m is the number of rows of the table, n is the number of columns of the table, in general, the columns of the table are synonymous and commonly used with the fields of the table, the field content set of the corresponding table is represented as the column data set of the corresponding table and is represented as { col1, col2, …, coli, … and coln }, coli represents the ith column data, i is represented as {1,2, …, n }, the set corresponding to coli is represented as { d1i, d2i, dji, …, dmi }, dji is represented as the j data item of a column data, j is represented as {1,2, …, m }, and m is corresponding to the number of rows of the table.

And 200, carrying out characterization processing on the data to be identified to obtain characterization data reflecting the change of the field content information quantity in the data to be identified.

The step is the basic work of identifying the sensitive data, and if the data to be identified contains non-character type data, the non-character type data is required to be converted into character type data before the step is executed.

The process of the characterization processing refers to the process of converting the data to be identified into a small amount of characterization data reflecting the change of the information quantity of the field content in the data to be identified according to a certain calculation process. The characteristic data is not limited in particular herein, and any data capable of reflecting the change amount of field content information in the data to be identified refers to the characteristic data described herein, such as a blank rate, an information amount (obtained by information entropy calculation), and the like.

Step 300, determining the classification and desensitization range of sensitive data in each field of the data to be identified according to the characterization data and the data feature identification rule of the content of each field in the data to be identified, wherein the data feature identification rule comprises the corresponding relation between the classification of the sensitive data and the discrimination rule containing the characterization data range and the desensitization range calculation rule.

In detail, the sensitive data classification may be set in advance according to the industry requirement to which the data belongs, the data feature recognition rule may be set in advance according to the characterization data, which is not limited herein, and the industry to which the data belongs and the identified characterization data are different, and the corresponding data feature recognition rule may also be different.

In some embodiments, the sensitive data classification may be specifically represented by a sensitive data level, e.g., the sensitive data level is classified as 1 to 5, the higher the level, the higher the sensitivity. In other embodiments, the sensitive data classification may also be represented by a sensitive data level name, e.g., a sensitive data level alias is referred to as free of sensitive information, specifiable sensitive information, semi-identifying information, identifiable information, etc. The desensitization range is used for limiting the range of desensitization characters, when the method is implemented, field contents, such as address fields, can be desensitized according to the desensitization range, the analyzed desensitization range is 9_16, encryption processing is needed to be carried out on the range from 10 th bit to 16 th bit, and when the method is implemented, the desensitization can be implemented by adding, and the specific desensitization method is not limited.

Taking a personal information table with more than one hundred fields as an example, the classification and desensitization range of sensitive data can be rapidly identified through steps 100-300.

The database sensitive data identification method provided by the embodiment can realize the following technical effects:

1. the scheme provides a general data characteristic recognition rule aiming at the recognition classification of the sensitive data, and does not need a specific preset recognition rule when in use. When the method is used, all database table data can be scanned only by configuring the connection information of the database, and sensitive data classification can be automatically identified. The data characteristic recognition rule of the specific field or the specific range field is not aimed at, so that the practicability and convenience are effectively improved, the manual intervention workload and the use threshold of setting rules are reduced, and the risk that sensitive data which is not aimed at by the preset recognition rule is lost is avoided. The method has the advantages that the value is more obvious when the number of the databases and the tables is large, a large amount of labor cost investment is saved, and the distribution situation of sensitive data is comprehensively mastered in a short time.

2. The sensitive data classification can be refined based on the unified data characteristic recognition rule, and the sensitive data classification problem is solved.

3. Suggested desensitization ranges are presented herein based on characterization data that can reflect the amount of change in field content information. The desensitization range is to reserve the mining value of the data content as much as possible on the premise of guaranteeing the safety of the field content as much as possible. The data mining analysis is carried out on the desensitized data, so that the provision of a data analysis application environment can be facilitated, and the input cost of data use environments such as special network machines (namely special content and special data analysis computers), camera monitoring and the like is reduced.

In an embodiment herein, in order to improve the recognition accuracy and speed, the characterizing data obtained in the step 200, which reflects the amount of change of the field content information in the data to be recognized, is shown in table 1, and at least includes: the blank rate nullProb of each field content, the original information amount originalEntropy of each field content, the length information amount lenEntropy of each field content, the maximum information amount duty ratio maxEntropyProp, the maximum length lmax of each field content and the interception reserved length keepLen of each field content in the data to be identified. Taking the identification number field as an example, the characterization data obtained is for example [0,0.069890771,8.897845456,1.0,9,18].

TABLE 1

Characterization data abbreviation	Characterizing data names	Use of the same
			nullProb	Field content empty rate	Sensitive identification
lenEntropy	Field content length information quantity	Sensitive identification
			originalEntropy	Original information content of field content	Sensitive identification
maxEntropyProp	Field maximum information content ratio	Sensitive identification
			keepLen	Maximum reserved length of field content	The desensitization range begins
lmax	Maximum length of field content	End of desensitization range

The calculation of these features is described in detail below:

1) The empty rate nullProb of each field content, the maximum length lmax of each field content, the original information amount originalEntropy of each field content, and the length information amount lenEntropy of each field content.

And (3) carrying out cyclic processing on the data to be identified according to the fields, namely processing the contents of each field in the data to be identified one by one, so as to obtain the empty rate nullProb of the contents of each field, the maximum length lmax of the contents of each field, the original information quantity originalEntropy of the contents of each field and the length information quantity lenEntropy of the contents of each field. Specifically, the calculation process includes:

(1) Null rate nullProb of each field content

And counting the total number of data items and the number of empty data items in any field content coll, dividing the number of empty data items in the field content coll by the duty ratio of empty data items dji in the field content coll, and taking the counted duty ratio as the empty rate nullProb of the field content coll.

(2) Maximum length lmax of each field content

For any field content coll, the character string length of each data item dji in the field content coll is counted, and the maximum length in the counted length sets { l1, l2, …, lm } is marked as the maximum length lmax of the field content.

(3) The original information amount originalEntropy of each field content

For any field content coll, inputting the field content coll into an information entropy tool function H (U), and calculating to obtain the original information amount originalEntropy of the field content.

The amount of original information is used to describe the degree of uncertainty of the field contents. The use of information entropy is described below, in a data information source, taking into account the average uncertainty of all possible occurrences of the data source. For example, the simplest single symbol information source takes only two elements, 0 and 1, with probabilities P and q=1-P. If the data source symbol has n values: u1 … Ui … Un, the corresponding probability is: p1 … Pi … Pn, and the occurrence of various symbols are independent of each other. At this point, the average uncertainty of the data source should be a single symbol uncertainty-the statistical average (E) of logPi, which may be referred to as entropy, i.e

Where the logarithm is 2 as the base and the unit is a bit.

(4) Content length information quantity lenEntropy of each field

For any field content coll, determining a set { l1, l2, …, lm } of each data item character string in the field, inputting the set { l1, l2, …, lm } of each data item character string in the field into an information entropy tool function H (U), and calculating to obtain a field content length information quantity lenEnatrpy, wherein the length information quantity is used for describing the uncertainty degree of the character length of each data item in the field.

2) Maximum information content ratio maxentroyprop of each field

The maximum information amount ratio maxentropyProp of each field is calculated according to the original information amount originalEntropy of each field. Specifically, the calculation process includes:

The maximum original information quantity is selected from all the field content original information quantity originalEntropy to be the maxEntropy, and for each field content original information quantity originalEntropy, the ratio of the field content original information quantity originalEntropy to the maximum original information quantity of the maxEntropy is calculated, and the obtained ratio is the maximum information quantity ratio maxEntropy Prop of the field, and the value range of the ratio is 0-1.

3) Maximum reservation length keepLen of each field content

And carrying out cyclic processing on the data to be identified according to the intercepting length of the field content to obtain the maximum reserved length keepLen of each field content. In detail, the interception length is the amount of characters required to be extracted from the data item of each field content, if the length of the data item is smaller than the interception length, all data in the data item is intercepted, and if the length of the data item is larger than the interception length, only the data with the interception length is intercepted from the data item. Specifically, the calculation process of the maximum reserved length keepLen of the contents of each field comprises the following steps:

(1) For each field, intercepting data from each data item of the field content according to the sequence from 1 to the maximum length lmax of the field content, so as to obtain a plurality of sub-contents. For example, as shown in table 2, the maximum length lmax of the content of a certain field is 3, and the data is intercepted from each data item in the field according to the order of intercepting length from 1 to 3, and the obtained sub-content is {5,3,1}, {5,35,10}, {5,35,101}.

TABLE 2

Age of
	5
35
	101

For the content of each field, along with the increase of the interception length, the content reserved in each field is more and more, namely the amount of the reserved information is more and more, the information amount can be calculated by information entropy, and the increment process of the information amount reflects the regular change process of the data content.

(2) And calculating the information entropy split Entropy of each sub-content in each field one by using the information entropy function.

(3) And calculating to obtain the information quantity duty ratio lenEntProp of each sub-content in each field according to the information entropy split Entropy of each sub-content in each field and the original information quantity originalEntropy of each field.

In specific implementation, the entropy split entropy of each piece of sub-content information in each field is divided by the original information amount originalEntropy of each piece of field, so that the duty ratio lenEntProp of each piece of sub-content information in each field can be obtained.

(4) And determining the maximum reserved length keepLen of the contents of each field, namely the position where the contents of each field start to be desensitized, according to the interception length corresponding to the sub-content information quantity duty ratio lenEntProp in each field.

The predetermined value may be set according to actual requirements, and in some embodiments, the predetermined value is 0.9.

Further, as shown in fig. 3, the step 300 of determining the classification and desensitization range of the sensitive data of each field content in the data to be identified according to the characteristic data and the data characteristic recognition rule of each field content in the data to be identified includes:

Step 310, determining the classification of the sensitive data of each field content according to the empty rate nullProb of each field content, the maximum information amount maxEntropy of each field content, the length information amount lenEntropy of each field content, the original information amount originalEntropy of each field and the corresponding relation in the data to be identified.

The classification of the sensitive data is represented by the sensitive data level and the sensitive data level name, the division of the sensitive data is shown in table 3, and the data characteristic recognition rule is shown in table 4.

TABLE 3 Table 3

In this embodiment, the sensitive data is divided into 5 sensitive classifications, from high to low corresponding to the sensitive degree of the data. More sensitive data classifications are beneficial to detailed classifications of data than sensitive and insensitive classifications, facilitating the selection of adapting different data desensitization strategies for different sensitive data classifications. At the same time, sensitive data classification can be used as supplementary information of a data dictionary, and can provide selection reference of field range when data is used, especially when a wide table with more data fields is just contacted. In the case of a large number of data tables in a data center, the fields of the tables can be checked according to the sensitive data classification, so that the distribution condition of the core fields of a large number of tables can be rapidly mastered, the utilization rate of the table data can be generally improved, and the loss of the data falling asleep value can be avoided.

TABLE 4 Table 4

The values of the characterization parameters in the discriminant rules shown in table 4 may be configured according to requirements, which are not particularly limited herein.

In detail, the specific implementation process of the step 310 includes: searching the affiliated judging rule from the corresponding relation according to the empty rate nullProb of each field content, the maximum information amount maxEntropy of each field content, the length information amount lenEntropy of each field content and the original information amount originalEntropy of each field in the data to be identified; and taking the sensitive data classification corresponding to the searched discrimination rule as the sensitive data classification of the corresponding field in the data to be identified.

Step 320, determining the desensitization range of each field according to the maximum length lmax of each field content, the maximum reserved length keepLen of each field content and the rule for computing the desensitization range in the data to be identified.

In detail, the desensitization range calculation rule specifies: determining a desensitization starting position according to the maximum reserved length keepLen of the contents of each field, wherein when the step is implemented, the next character with the maximum reserved length keepLen can be used as the desensitization starting position; the final desensitization position is determined according to the maximum length lmax of each field content, and the maximum length lmax of each field content can be used as the final desensitization position when the step is implemented.

The database sensitive data identification method provided by the invention can obtain the characterization data reflecting the change of the information quantity of the field content in the data to be identified by carrying out characterization processing on the data to be identified, and determines the classification and the desensitization range of the sensitive data of each field content in the data to be identified according to the characterization data and the data characteristic identification rule of each field content in the data to be identified, thereby realizing the following technical effects:

In an embodiment herein, in order to facilitate the user to view and further share the identification result, the identification result (the sensitive data classification and the desensitization range) may be output to the result file under the configuration directory. After the identification result is output, the identification result is required to be associated with metadata information (namely data to be identified before identification) of the object table, and the association result is written into a result file. The table result of the result file is { database table name, field type, field comment, content empty rate, content maximum length, execution time, sensitive data level name, desensitization range }, as shown in fig. 4, the desensitization range may be expressed in the form of "m_n", where "m_n" means a character string with a length of n, the character string subscript starts from 0, and the desensitization processing is performed on the character beginning with the subscript m bits, for example, the partial character hiding desensitization effect according to the identity card number of 18 bits in the proposed desensitization range is "432522199". The embodiment can guide the application of the follow-up data security.

Based on the same inventive concept, there is also provided herein a database sensitive data identification system, as described in the following embodiments. Because the principle of solving the problem of the database sensitive data recognition system is similar to that of the database sensitive data recognition method, the implementation of the database sensitive data recognition system can refer to the database sensitive data recognition method, and the repetition is not repeated. Specifically, as shown in fig. 5, the database sensitive data identification system includes:

the database connection module 510 is configured to connect to a database to obtain data to be identified.

And the characterization processing module 520 is configured to perform characterization processing on the data to be identified, so as to obtain characterization data reflecting the change amount of field content information in the data to be identified.

The sensitive data identification module 530 is configured to determine a classification and a desensitization range of sensitive data of each field content in the data to be identified according to the characteristic data and the data characteristic identification rule of each field content in the data to be identified;

In a specific embodiment, the characterizing data at least includes: the data to be identified comprises a null rate nullProb of each field content, an original information amount originalEntropy of each field content, a length information amount lenEntropy of each field content, a maximum information amount duty ratio maxEntropyProp of each field, a maximum length lmax of each field content and a maximum reserved length keepLen of each field content. The null rate nullProb of each field content, the original information amount originalEntropy of each field content, the length information amount lenEntropy of each field content, and the maximum information amount duty ratio maxEntropyProp of each field are used for identifying the sensitive data classification of the field content, and the maximum length lmax of each field content and the maximum reserved length keepLen of each field content are used for identifying the desensitization range of the field.

Correspondingly, as shown in fig. 6, the characterization processing module 520 includes:

the first characterization unit 521 is configured to perform a loop processing on the data to be identified according to a field, so as to obtain an empty rate nullProb of each field content, a maximum length lmax of each field content, an original information amount originalEntropy of each field content, and a length information amount lenEntropy of each field content.

In implementation, the cyclic processing of the data fields to be identified by the first characterizing unit 521 includes the following processing of the content coll of each field in the data to be identified one by one: counting the empty duty ratio of a data item dji in the field content coll, and taking the counted duty ratio as the empty rate nullProb of the field content coll; counting the length of each data item character string in the field content coll, and recording the maximum length in the counted length sets { l1, l2, …, lm } as the maximum length lmax of the field content; inputting the field content coll into an information entropy tool function H (U), and calculating to obtain the original information amount originalEntropy of the field content; the length set { l1, l2, …, lm } of each data item character string in the field is determined, the length set { l1, l2, …, lm } of each data item character string in the field is input into the information entropy tool function H (U), and the content length information quantity lenEntropy of the field is obtained through calculation.

A second characterization unit 522, configured to calculate, according to the original information amount originalEntoprop of each field, the maximum information amount duty ratio maxEntoprop of each field. In practice, the second characterization unit 522 calculates the maximum information amount per field ratio maxEntropyProp procedure including: selecting the maximum original information amount as the MaxEntropy from all field content original information amounts originalEntropy; the ratio of the original information amount originalEntropy to the maximum original information amount of the MaxEntropy is taken as the maximum information amount duty ratio maxEntropyProp of each field.

And a third characterization unit 523, configured to perform a cyclic processing on the data to be identified according to the intercepting length of the field content, so as to obtain a maximum reserved length keepLen of each field content. In practice, the third characterization unit 523 calculation process includes: for each field, intercepting data from each data item in the field content according to the sequence from 1 to the maximum length lmax of the field content to obtain a plurality of sub-contents; calculating the information entropy split entropy of each sub-content in each field one by utilizing an information entropy function; calculating to obtain the information quantity ratio lenEntProp of each sub-content in each field according to the information entropy split Entropy of each sub-content in each field and the original information quantity originalEntropy of each field content; and determining the maximum reserved length keepLen of the contents of each field according to the interception length corresponding to the sub-content information quantity occupying ratio lenEntProp larger than a preset value in each field.

As shown in fig. 7, fig. 7 illustrates a block diagram of a sensitive data identification module of embodiments herein. Specifically, the sensitive data identification module includes:

the sensitive data classification and identification unit 531 is configured to determine the sensitive data classification of each field according to the null rate nullProb of each field content, the maximum information amount duty ratio maxEntropy of each field content, the length information amount lenenttropy of each field content, the original information amount originalenttropy of each field in the data to be identified, and the correspondence. Specifically, the processing procedure of the sensitive data classification and identification unit 531 includes: searching the affiliated judging rule from the corresponding relation according to the empty rate nullProb of each field content, the maximum information amount maxEntropy of each field content, the length information amount lenEntropy of each field content and the original information amount originalEntropy of each field in the data to be identified; and taking the sensitive data classification corresponding to the searched discrimination rule as the sensitive data classification of the corresponding field in the data to be identified.

The desensitization range identifying unit 532 is configured to determine the desensitization range of each field according to the maximum length lmax of each field content in the data to be identified, the maximum reserved length keepLen of each field content, and the rule for computing the desensitization range. Specifically, the desensitization range identifying unit 532 is specifically configured to determine, according to the rule for computing the desensitization range, a starting desensitization position according to the maximum reservation length keepLen of the contents of each field; and determining the end desensitization position according to the maximum length lmax of the contents of each field.

The database sensitive data identification system provided herein can achieve the following technical effects:

1) Under the condition of incompletely mastering the data, the sensitive data discovery identification can be implemented, and the potential sensitive data in the massive data can be discovered.

2) The sensitive data identification rule can be not preset for specific data fields, so that the use threshold can be reduced and the workload can be reduced.

3) Sensitive data can be finely classified in a grading way, and further fine management of the data is facilitated.

4) Under the condition of ensuring the data to have stronger desensitization safety degree, the proposal desensitization range of sensitive data can be given, and the trade-off difficult problem between the data safety and the data mining analysis can be solved to a great extent.

In one embodiment herein, as shown in FIG. 8, a computer device is also provided, and the computer device 802 may include one or more processors 804, such as one or more Central Processing Units (CPUs), each of which may implement one or more hardware threads. The computer device 802 may also include any memory 806 for storing any kind of information, such as code, settings, data, etc., in one embodiment a degree of computer executable on the processor 804, the processor 804 executing a computer program implementing a database sensitive data identification method as described in any of the previous embodiments. For example, and without limitation, memory 806 may include any one or more of the following combinations: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may store information using any technique. Further, any memory may provide volatile or non-volatile retention of information. Further, any memory may represent fixed or removable components of computer device 802. In one case, the computer device 802 may perform any of the operations of the associated instructions when the processor 804 executes the associated instructions stored in any memory or combination of memories. The computer device 802 also includes one or more drive mechanisms 808, such as a hard disk drive mechanism, an optical disk drive mechanism, and the like, for interacting with any memory.

The computer device 802 may also include an input/output module 810 (I/O) for receiving various inputs (via input device 812) and for providing various outputs (via output device 814)). One particular output mechanism may include a presentation device 816 and an associated Graphical User Interface (GUI) 818. In other embodiments, input/output module 810 (I/O), input device 812, and output device 814 may not be included, but merely as a computer device in a network. The computer device 802 may also include one or more network interfaces 820 for exchanging data with other devices via one or more communication links 822. One or more communications buses 824 couple the above-described components together.

The communication link 822 may be implemented in any manner, such as, for example, through a local area network, a wide area network (e.g., the internet), a point-to-point connection, etc., or any combination thereof. Communication link 822 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.

In an embodiment herein, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the database sensitive data identification method according to any of the embodiments above.

Embodiments herein also provide a computer readable instruction, wherein the program therein causes the processor to perform the steps of the database sensitive data identification method of any of the previous embodiments when the processor executes the instruction.

It should be understood that, in the various embodiments herein, the sequence number of each process described above does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments herein.

It should also be understood that in embodiments herein, the term "and/or" is merely one relationship that describes an associated object, meaning that three relationships may exist. For example, a and/or B may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and unit may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided herein, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the elements may be selected according to actual needs to achieve the objectives of the embodiments herein.

In addition, each functional unit in the embodiments herein may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions herein are essentially or portions contributing to the prior art, or all or portions of the technical solutions may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments herein. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Specific examples are set forth herein to illustrate the principles and embodiments herein and are merely illustrative of the methods herein and their core ideas; also, as will be apparent to those of ordinary skill in the art in light of the teachings herein, many variations are possible in the specific embodiments and in the scope of use, and nothing in this specification should be construed as a limitation on the invention.

Claims

1. A method for identifying database sensitive data, comprising:

connecting a database to obtain data to be identified;

carrying out characterization processing on the data to be identified to obtain characterization data reflecting the change of the field content information quantity in the data to be identified, wherein the characterization data at least comprises:

the blank Prob of each field content, the original information amount originalEntropy of each field content, the length information amount lenEntropy of each field content, the maximum information amount occupation ratio maxEntropyProp of each field content, the maximum length lmax of each field content and the maximum reserved length keepLen of each field content in the data to be identified;

The data characteristic recognition rule comprises a corresponding relation between sensitive data classification and a discrimination rule containing a characteristic data range and a desensitization range calculation rule;

wherein, according to the characteristic data and the data characteristic recognition rule of each field content in the data to be recognized, determining the sensitive data classification and the desensitization range of each field content in the data to be recognized comprises:

the sensitive data classification corresponding to the searched discrimination rules is used as the field content sensitive data classification in the data to be identified;

2. The method of claim 1, wherein connecting the database to obtain the data to be identified comprises:

3. The method of claim 1, wherein characterizing the data to be identified to obtain characterization data reflecting changes in the amount of field content information in the data to be identified, comprises:

4. A method according to claim 3, wherein the performing cyclic processing on the data to be identified according to fields to obtain null rate nullProb of content of each field comprises: counting the total number of data items in the contents of each field of the data to be identified and the number of empty data items;

5. A method according to claim 3, wherein the step of performing cyclic processing on the data to be identified according to fields to obtain maximum lengths lmax of contents of each field comprises:

6. A method according to claim 3, wherein the step of performing cyclic processing on the data to be identified according to fields to obtain the content length information quantity lenEntropy of each field comprises the steps of:

7. A method according to claim 3, wherein calculating the maximum information amount duty cycle maxentroyprop for each field based on the original information amount originalEntropy for each field, comprises:

8. A method according to claim 3, wherein the step of performing a cyclic processing on the data to be identified according to the length of interception of the field content to obtain a maximum reservation length keepLen of the content of each field comprises:

9. A database sensitive data identification system, comprising:

the characterizing processing module is used for characterizing the data to be identified to obtain characterizing data reflecting the change of the field content information quantity in the data to be identified, wherein the characterizing data at least comprises:

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any one of claims 1 to 8 when executing the computer program.

11. A computer readable storage medium, characterized in that the computer readable storage medium stores an executing computer program, which when executed by a processor implements the method of any of claims 1 to 8.