CN110598442A

CN110598442A - Sensitive data self-adaptive desensitization method and system

Info

Publication number: CN110598442A
Application number: CN201910860749.6A
Authority: CN
Inventors: 叶卫; 黄宇腾; 戚伟强; 沈志豪; 张景明; 韦金良; 董科; 季超; 牟黎; 耿继朴; 尚天婷; 陈泽堃; 伍星宇; 陈珊; 王嘉怡
Original assignee: Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2019-12-20

Abstract

The invention discloses a sensitive data self-adaptive desensitization method and a sensitive data self-adaptive desensitization system, which relate to the field of computer technology and information security and comprise the following steps: adding a plurality of desensitization algorithms in a desensitization server, and setting a one-to-one corresponding quantitative relation between each desensitization algorithm and each desensitization effect in a plurality of desensitization effects; the desensitization server receives a desensitization instruction sent by the user side equipment and reads original data from the data source server according to the desensitization instruction; the desensitization server constructs a desensitization effect preference training set of a user for different sensitive data types to form a decision tree; the desensitization server locates the sensitive data existing in the original data and determines the type of the sensitive data, a desensitization algorithm is selected for the sensitive data by using a decision tree, and replacement data of the sensitive data are generated according to the desensitization algorithm. The invention has simple user configuration flow and can realize intelligent automatic configuration and automatic desensitization of desensitization strategies.

Description

Sensitive data self-adaptive desensitization method and system

Technical Field

The invention relates to the field of computer technology and information security, in particular to a sensitive data self-adaptive desensitization method and system.

Background

With the advent of the data age, the huge value of data is mined, and meanwhile, the difficulties in the protection of private information and key sensitive data are brought. How to realize the efficient sharing of data and protect sensitive information from being leaked becomes a key link of data security intelligent development.

Data desensitization is the change of the value of data while preserving its original characteristics, thereby protecting sensitive data from unauthorized access, while allowing related data processing, preserving data security while preserving data meaning and validity, and complying with data privacy regulations. By means of data desensitization, information can still be used and associated with the business without violating relevant regulations and the risk of data leakage is also avoided.

Before desensitization rules are set for sensitive data fields, a user often needs to learn a data desensitization strategy preset by a system, even to perform personalized customization and modification, so that the operation and maintenance cost of the user is greatly increased. And because the implementation mode of the desensitization strategy is generally completed by a system-specified algorithm, the effect of the user after desensitization on sensitive data cannot be expected. According to the practical use condition of the deployed desensitization system, the situation that the user is not satisfied with the desensitization effect of the sensitive data and the strategy modification is repeated occurs.

Disclosure of Invention

The invention aims to make up for the defects in the prior art, and provides a sensitive data self-adaptive desensitization method and system, which take desensitization effect as guidance and simplify user configuration flow; meanwhile, for a dynamic desensitization use scene, the learning of a user use model is facilitated, and intelligent automatic configuration of a desensitization strategy is realized, so that automatic desensitization is realized.

According to an aspect of the invention: a method of adaptive desensitization of sensitive data, comprising the steps of:

adding a plurality of desensitization algorithms in a desensitization server, and setting a one-to-one corresponding quantitative relation between each desensitization algorithm and each desensitization effect in a plurality of desensitization effects;

the desensitization server receives a desensitization instruction sent by user side equipment, and reads original data from a data source server according to the desensitization instruction, wherein the desensitization instruction comprises a sensitive data type, at least one desensitization effect and priority sequencing of the at least one desensitization effect;

the desensitization server constructs a desensitization effect preference training set of the user for different sensitive data types according to the set quantitative relation and the received desensitization instruction contained priority sequence to form a decision tree;

the desensitization server positions sensitive data existing in the original data and determines the type of the sensitive data, a desensitization algorithm is selected for the sensitive data by using the decision tree, replacement data of the sensitive data are generated according to the desensitization algorithm, and the sensitive data in the original data are replaced by the corresponding replacement data to generate desensitization data;

and the desensitization server sends the desensitization data to customer premise equipment.

Further, the sensitive data types include name, address, mailbox, telephone, certificate, account number, zip code, date.

Further, the desensitization algorithms include a plurality of replacement algorithms, invalidation algorithms, out-of-order algorithms, average value taking algorithms, anti-association algorithms, migration algorithms, symmetric encryption algorithms, and dynamic environment control algorithms.

Further, the at least one desensitization effect includes at least one of effectiveness, correlation, reversibility, repeatability, timeliness, and safety.

According to another aspect of the invention: an adaptive desensitization system for sensitive data, comprising: the client device is used for sending desensitization instructions and receiving desensitization data; the data source server is used for storing original data; a desensitization server, configured to add a plurality of desensitization algorithms, set a one-to-one quantitative relationship between each desensitization algorithm and each desensitization effect of the plurality of desensitization effects, and further configured to receive a desensitization instruction sent by a user end device, and read original data from a data source server according to the desensitization instruction, where the desensitization instruction includes a sensitive data type, at least one desensitization effect, and a priority ranking of the at least one desensitization effect, and is further configured to construct a desensitization effect preference training set of a user for different sensitive data types according to the set quantitative relationship and the priority ranking included in the received desensitization instruction, form a decision tree, and further configured to locate sensitive data existing in the original data, determine a type of the sensitive data, and select a desensitization algorithm for the sensitive data using the decision tree, and generating replacement data of the sensitive data according to the desensitization algorithm, replacing the sensitive data in the original data with corresponding replacement data to generate desensitization data, and sending the desensitization data to user end equipment.

Further, the desensitization server comprises a setting module, a decision tree forming module and a desensitization processing module, wherein the setting module is used for adding a plurality of desensitization algorithms and setting one-to-one quantitative relationship between each desensitization algorithm and each desensitization effect in a plurality of desensitization effects; the decision tree forming module is used for constructing a desensitization effect preference training set of the user for different sensitive data types according to the set quantitative relation and the received desensitization instruction contained priority sequence to form a decision tree; the desensitization processing module is used for positioning the sensitive data existing in the original data and determining the type of the sensitive data, selecting a desensitization algorithm for the sensitive data by using the decision tree, generating replacement data of the sensitive data according to the desensitization algorithm, and replacing the sensitive data in the original data with the corresponding replacement data to generate desensitization data.

The invention has the beneficial effects that: the user can release from the heavy rule configuration work, only the result characteristics of the whole desensitization task need to be concerned, priority ordering is carried out on the characteristic requirements, the recommended algorithm configuration of all fields can be obtained through a system algorithm, and the user configuration flow is simple; for a dynamic desensitization use scene, the learning of a user use model is facilitated, and intelligent automatic configuration of a desensitization strategy is realized, so that automatic desensitization is realized.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the description of the present invention will be briefly introduced below, and it is apparent that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor.

FIG. 1 is a diagram of an exemplary architecture of a sensitive data adaptive desensitization system according to an embodiment of the present invention.

Fig. 2 is a block diagram of a desensitization server according to an embodiment of the present invention.

Fig. 3 is a flowchart illustrating a sensitive data adaptive desensitization method according to an embodiment of the present invention.

FIG. 4 is a schematic block diagram of an apparatus for use in embodiments of the present invention.

Detailed Description

The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are some, not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the computer system/server include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

The computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Fig. 1 shows an exemplary architecture diagram of a desensitization system for sensitive data adaptation according to an embodiment of the present invention, and as shown in fig. 1, a desensitization system 100 includes a client device 110, a desensitization server 120, and a data source server 130. The customer premise equipment 110 is connected with the desensitization server 120 through the network 140 in a wired or wireless manner, and the desensitization server 120 is connected with the data source server 130 through the network 150 in a wired or wireless manner.

The customer premises device 110, desensitization server 120 and data origin server 130 are physically separated from each other. Although only customer premises equipment 110, desensitization server 120, and data origin server 130 are shown in fig. 1, the desensitization system 100 described above may include other one or more devices not shown, such as network elements like routers, switches, etc.

The customer premise equipment 110 may be a PC, a mobile terminal, or the like, the customer premise equipment 110 sends desensitization instructions and receives desensitization data, and may store and view desensitization data, and the data source server 130 stores original data.

The desensitization server 120 may add a plurality of desensitization algorithms through an input unit, and set a one-to-one quantitative relationship between each desensitization algorithm and each desensitization effect of the plurality of desensitization effects.

The desensitization algorithms include a plurality of substitution algorithms, invalidation algorithms, out-of-order algorithms, average value algorithms, anti-association algorithms, migration algorithms, symmetric encryption algorithms, and dynamic environment control algorithms. The replacement algorithm replaces real data with fictional data, such as establishing a large dictionary data table, generating random factors for each real value record, and replacing the dictionary table contents for the original data contents, wherein the data obtained by the algorithm is very similar to the real data. The invalidation algorithm replaces the truth value or a portion of the truth value with a special symbol, such as the first 6-14 bits of the identification number. The disorder algorithm carries out random redistribution on the values of the sensitive data column, and the relation between the original values and other fields is confused. The average value-taking algorithm aims at numerical data, firstly calculates the average value of the numerical data, and then randomly distributes desensitized values around the average value so as to keep the sum of the data unchanged, and is generally used for occasions such as cost tables, payroll tables and the like. The anti-association algorithm looks for mappings that may infer from certain fields another sensitive field and desensitizes these fields, such as the context in which the identification number, gender, region can be inferred from the date of birth. The offset algorithm changes the digital data by random shifting. The symmetric encryption algorithm is a special reversible desensitization method, original data is encrypted through an encryption key and the algorithm, the format of a ciphertext is consistent with that of the original data in a logic rule, and the original data can be recovered through a decryption key. The dynamic environment control algorithm only changes part of response data according to a predefined rule, if the business data is not accessed under the appointed condition, the data content is controlled, the content of a specific field is shielded, if important customer information is not displayed for a DBA (database administrator) account, the important customer information is only displayed for key users of the business module.

The at least one desensitization effect includes at least one of effectiveness, correlation, reversibility, repeatability, timeliness, and safety. The meaning of the effectiveness effect is as follows: according to the specific requirements of the desensitization task, the result after desensitization is often required to be service validity, random nonsense text sequences or numerical values cannot be simply generated, or simple deletion, truncation and mask processing are carried out, the service attribute of original data of the data is reserved to a certain extent according to the specific scene of service use, even the sampling characteristic of the data needs to be kept unchanged under partial scenes, for example, the data after the identification number desensitization still meets the characteristic of the identification number, the data can be correctly verified through the validity rule of the data, and the information contained in the identification number is also a meaningful area code or a meaningful birthday, but is a non-random numerical value. The connotation of the relevance is as follows: for structured data, particularly for data sets with very complex relationships between different data elements, there is often a correspondence between a field and another field in the same data table, and generally, this correspondence should not be destroyed before and after data desensitization, otherwise, the use value of the field will not exist, and generally, in the case of a reference quantity required for data statistics, the requirement on the relevance of the data is high. The content of the reversible effect is as follows: in a general application scenario where desensitization is used, the desensitized data can never be restored to the original traffic data. Most desensitizing products on the market are designed in this mode. However, with the increasing popularity of big data analysis, third-party business intelligence, accurate marketing service and the like are widely accepted, and business departments often need to restore desensitized data to original business data so as to carry out subsequent work. For example, after desensitizing the service data, the telecommunication company may send the service data to a third party for user behavior analysis, and after the final result is completed, the telecommunication company may restore the desensitized data so as to accurately draw an image of the user and perform an accurate marketing campaign, and in this case, the user may require reversibility of data desensitization. The content of the repeatability effect is as follows: in some business scenarios, desensitization must be a repeatable process. Desensitization is carried out on the same data for multiple times, or desensitization is carried out on different test systems to ensure that the data of each desensitization keep consistent, so that the desensitized results of the data in special environments such as increment and the like can still be effectively correlated; in other scenes, for the consideration of confidentiality, the desensitization results of the same data field (such as an identity card number, a credit card number and the like) are not necessarily the same every time, so that the original service data can be prevented from being restored by a hacker through reverse engineering after collecting a large amount of desensitization data; thus, the desensitization product should allow the user to choose whether the desensitization result is repeatable for a particular type of data and a particular scenario when configuring the policy. The content of the timeliness effect is as follows: in part of service requirements and application scenarios, especially in a dynamic desensitization scenario, a user often has a high requirement on desensitization timeliness, and desensitization data may not have a meaning of further analysis and mining after a certain time, so that data desensitization in such a scenario should avoid selecting a time-consuming desensitization algorithm, such as an encryption algorithm, as much as possible. The safety effect is characterized in that: for the desensitization of part of high-level sensitive data, the requirement on security is often high, and other requirements of users can serve the requirement on security, and for such requirement, an irreversible algorithm or other algorithms capable of ensuring that information is not leaked should be selected.

Through research, such as a mathematical method, a physical method, and the like, a quantitative relationship corresponding to each desensitization algorithm and each desensitization effect can be determined, and the quantitative relationship can be a relationship that distinguishes each desensitization effect corresponding to each desensitization algorithm as high, medium, and low, or can be a specific numerical value assigned to each desensitization effect corresponding to each desensitization algorithm.

The desensitization server 120 is further configured to receive a desensitization instruction sent by the user end device 110, read original data from the data source server 130 through the communication unit according to the desensitization instruction, and store the original data in the storage unit, where the desensitization instruction includes a sensitive data type, at least one desensitization effect, and a priority order of the at least one desensitization effect. Sensitive data typically contains customer personal privacy data as well as some key sensitive business data, e.g., name: client name, etc.; address: home address, company address, etc.; mail box: corporate mailboxes, regular mailboxes, and the like; telephone: mobile phones, fixed phones, etc.; certificate: identity cards, passports, officer's licenses, and the like; account number: bank card, customer number, tax registration number, organization code, business license number, etc.; and E, postcode: company zip code, home address zip code, etc.; date: birthday, etc. Thus, the sensitive data types may be name, address, mailbox, telephone, certificate, account number, zip code, date, etc. The user-side device 110 may send the type of sensitive data that needs desensitization, at least one of the effects of validity, relevance, reversibility, repeatability, timeliness, and security, and prioritize the sent at least one desensitization effect.

The desensitization server 120 is further configured to construct a desensitization effect preference training set of the user for different sensitive data types according to the set quantitative relationship and the received desensitization order included in the desensitization instruction sent by the user end device 110, and form a decision tree.

The desensitization server 120 may further locate the sensitive data existing in the original data according to a sensitive data identification method (such as including manual configuration matching, regular expression matching, or other intelligent identification algorithms), determine the type of the sensitive data, select a desensitization algorithm for the sensitive data by using the decision tree, generate replacement data of the sensitive data according to the desensitization algorithm, replace the sensitive data in the original data with corresponding replacement data to generate desensitization data, and send the desensitization data to the user end device 110.

Fig. 2 is a schematic block diagram of a desensitization server according to an embodiment of the present invention, and as shown in fig. 2, the desensitization server 120 includes a setting module 160, a decision tree forming module 170, and a desensitization processing module 180, where the setting module 160 is configured to add a plurality of desensitization algorithms and set a one-to-one quantitative relationship between each desensitization algorithm and each desensitization effect of a plurality of desensitization effects; the decision tree forming module 170 is configured to construct a desensitization effect preference training set of the user for different sensitive data types according to the set quantitative relationship and the received desensitization order included in the desensitization instruction, so as to form a decision tree; the desensitization processing module 180 is configured to locate the sensitive data existing in the original data and determine the type of the sensitive data, select a desensitization algorithm for the sensitive data by using the decision tree, generate replacement data of the sensitive data according to the desensitization algorithm, and replace the sensitive data in the original data with the corresponding replacement data to generate desensitization data.

Fig. 3 is a schematic flow chart of a desensitization method 200 for sensitive data adaptation according to an embodiment of the present invention, and as shown in fig. 3, in step 201, a plurality of desensitization algorithms are added to the desensitization server 120, and a one-to-one quantitative relationship between each desensitization algorithm and each desensitization effect of a plurality of desensitization effects is set; in step 202, the desensitization server 120 receives a desensitization instruction sent by the customer premises equipment 110, and reads original data from the data source server 130 according to the desensitization instruction, where the desensitization instruction includes a sensitive data type, at least one desensitization effect, and a priority order of the at least one desensitization effect; in step 203, the desensitization server 120 constructs a desensitization effect preference training set of the user for different sensitive data types according to the set quantitative relationship and the received desensitization order contained in the desensitization instruction, and forms a decision tree; in step 204, the desensitization server 120 locates the sensitive data existing in the original data and determines the type of the sensitive data, selects a desensitization algorithm for the sensitive data by using the decision tree, generates replacement data of the sensitive data according to the desensitization algorithm, and replaces the sensitive data in the original data with the corresponding replacement data to generate desensitization data; in step 205, the desensitization server 120 sends the desensitization data to the customer premises device 110.

In order to implement the functions of the client device 110, the desensitization server 120, and the data origin server 130, the client device 110, the desensitization server 120, and the data origin server 130 may be implemented using the device 300. Fig. 4 is a schematic block diagram of an apparatus 300 for implementing an embodiment of the present invention, as shown in fig. 4, the apparatus 300 comprising a Central Processing Unit (CPU)301 which may perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)302 or computer program instructions loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the RAM303, various programs and data required for the operation of the device 300 can also be stored. The CPU301, ROM 302, and RAM303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Various components in device 300 are connected to I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, or the like; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

In the above method 200, in step 201, a plurality of desensitization algorithms may be added through the input unit 306 of the desensitization server 120, and a one-to-one quantitative relationship between each desensitization algorithm and each desensitization effect of the plurality of desensitization effects is set and stored in the Read Only Memory (ROM)302 or the Random Access Memory (RAM) 303. In step 202, the desensitization server 120 is in communication connection with the communication unit 309 of the user end device 110 through the communication unit 309, receives the desensitization instruction sent by the user end device 110, and is in communication connection with the communication unit 309 of the data source server 130 through the communication unit 309 of the desensitization server 120, so as to obtain the original data from the data source server 130. In step 203, the Central Processing Unit (CPU)301 of the desensitization server 120 constructs a desensitization effect preference training set of the user for different sensitive data types according to the set quantitative relationship and the received desensitization order contained in the desensitization instruction, and forms a decision tree. In step 204, a Central Processing Unit (CPU)301 of the desensitization server 120 locates the sensitive data existing in the original data and determines the type of the sensitive data, selects a desensitization algorithm for the sensitive data by using the decision tree, generates replacement data of the sensitive data according to the desensitization algorithm, and replaces the sensitive data in the original data with the corresponding replacement data to generate desensitization data. In step 205, the communication unit 309 of the desensitization server 120 is in communication connection with the communication unit 309 of the customer premises equipment 110, the desensitization data is sent to the customer premises equipment 110, and may be stored in the Read Only Memory (ROM)302 or the Random Access Memory (RAM)303 of the customer premises equipment 110, and the customer premises equipment 110 may view the stored desensitization data.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of adaptive desensitization of sensitive data, comprising the steps of:

2. Desensitization method according to claim 1, characterized in that said sensitive data types comprise name, address, mailbox, telephone, certificate, account number, zip code, date.

3. A method of desensitization according to claim 2, wherein said plurality of desensitization algorithms comprises a plurality of substitution algorithms, invalidation algorithms, out-of-order algorithms, averaging algorithms, anti-correlation algorithms, migration algorithms, symmetric encryption algorithms, and dynamic environment control algorithms.

4. A method of desensitization according to claim 3, wherein said at least one desensitization effect includes at least one of effectiveness, correlation, reversibility, repeatability, timeliness and safety.

5. An adaptive desensitization system for sensitive data, comprising: the client device is used for sending desensitization instructions and receiving desensitization data; the data source server is used for storing original data; a desensitization server, configured to add a plurality of desensitization algorithms, set a one-to-one quantitative relationship between each desensitization algorithm and each desensitization effect of the plurality of desensitization effects, and further configured to receive a desensitization instruction sent by a user end device, and read original data from a data source server according to the desensitization instruction, where the desensitization instruction includes a sensitive data type, at least one desensitization effect, and a priority ranking of the at least one desensitization effect, and is further configured to construct a desensitization effect preference training set of a user for different sensitive data types according to the set quantitative relationship and the priority ranking included in the received desensitization instruction, form a decision tree, and further configured to locate sensitive data existing in the original data, determine a type of the sensitive data, and select a desensitization algorithm for the sensitive data using the decision tree, and generating replacement data of the sensitive data according to the desensitization algorithm, replacing the sensitive data in the original data with corresponding replacement data to generate desensitization data, and sending the desensitization data to user end equipment.

6. The desensitization system according to claim 5, wherein the desensitization server comprises a setting module, a decision tree forming module and a desensitization processing module, the setting module is configured to add a plurality of desensitization algorithms and set a one-to-one quantitative relationship between each desensitization algorithm and each desensitization effect of the plurality of desensitization effects; the decision tree forming module is used for constructing a desensitization effect preference training set of the user for different sensitive data types according to the set quantitative relation and the received desensitization instruction contained priority sequence to form a decision tree; the desensitization processing module is used for positioning the sensitive data existing in the original data and determining the type of the sensitive data, selecting a desensitization algorithm for the sensitive data by using the decision tree, generating replacement data of the sensitive data according to the desensitization algorithm, and replacing the sensitive data in the original data with the corresponding replacement data to generate desensitization data.