CN114003937A

CN114003937A - Data desensitization method based on characteristic rule desensitization segment

Info

Publication number: CN114003937A
Application number: CN202111315711.4A
Authority: CN
Inventors: 陈长辉; 黄有福; 林勤; 钟煜明
Original assignee: Guangzhou Panyu Polytechnic
Current assignee: Guangzhou Panyu Polytechnic
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-02-01

Abstract

The invention discloses a data desensitization method based on a characteristic rule desensitization section, which comprises the steps of selecting target data containing sensitive data according to a request party, obtaining the characteristic rule containing the sensitive data, dividing sensitive positions based on the characteristic rule to obtain sections to be desensitized, desensitizing the desensitization sections, and deforming the desensitization section data according to the desensitization rule without being hidden or shielded into valueless data, so that the desensitization data and the data characteristics are ensured, the application value is reserved, the sensitive data is prevented from being leaked due to desensitization failure, and the information security is improved.

Description

Data desensitization method based on characteristic rule desensitization segment

Technical Field

The invention relates to the technical field of information security, in particular to a sensitive data desensitization method for desensitization data after desensitization based on characteristic rules to retain application value.

Background

With the rapid development of internet technology, a large amount of information and data are stored in government and enterprise databases, and once information such as identity numbers, mobile phone numbers, bank card numbers, client financial data and the like is leaked, huge trust crisis and economic loss can be brought to data custodians, and desensitization treatment on sensitive data is a positive and effective means for preventing data leakage.

Data desensitization refers to the deformation of data of some sensitive information through desensitization rules, so that the reliable protection of sensitive private data is realized. Desensitization of sensitive data refers to data deformation of some sensitive information through desensitization rules, so that reliable protection of sensitive private data is realized. Namely, under the condition of relating to client safety data or some business sensitive data, the real data (information) is transformed and then provided for test use under the condition of not violating system rules, so that the desensitized real data set can be safely used in development, test and other non-production environments and outsourcing environments.

The method is divided into static data desensitization and dynamic data desensitization according to different application scenes of the data desensitization. The static data desensitization is suitable for desensitizing the production environment in which the data are extracted and then distributed to the scenes of testing, development, training, data analysis and the like, and the dynamic data desensitization is suitable for real-time desensitization of the inquiry and calling results of the sensitive data without being separated from the production environment.

In the past, under the conditions of small data volume and simpler application scene, most enterprises write desensitization scripts by themselves to shield sensitive data. With the ever-increasing number of applications, data volumes, a number of static data desensitization techniques and methods have emerged.

But in the prior art, the privacy types caused by omission, negligence, carelessness and the like are not identified, so that sensitive data is leaked. In addition, the existing desensitization algorithm is simple, and the application value of data after desensitization cannot be guaranteed. Desensitization is not only data deformation, but also guarantees the use value of testing, analysis and the like of the deformed data.

For example, in a conventional data desensitization mode, a server desensitizes all data, and there is a problem that desensitization data stays in the server for a long time and occupies a large amount of server storage resources because target data desensitized in a local storage space is read through an interface when other business systems access the sensitive data.

For another example, in the conventional desensitization process, i.e., the sensitive field is represented by a special symbol, such as identification number 111222197701013334, and after desensitization, it becomes 111222 x 4, and this desensitization process hides the sensitive information but loses the characteristics of the data itself, i.e., the desensitized portion is a number and has special meaning.

For example, the identification number meets the requirements of the area address code (6 bits), the birth date (8 bits), the sequence code (3 bits) and the check code (1 bit), but 8 in the desensitized data shows the same birth date and 3-bit sequence code, and the characteristics between the birth date and the sequence code are ignored, so that the application value of the desensitized data is reduced.

In conclusion, if the data after desensitization completely lose the application value, the desensitization technique has no meaning.

Disclosure of Invention

In order to solve the problems in the prior art, the invention aims to provide a desensitization scheme for ensuring that desensitized data still has data characteristics, so that the desensitized data still has application value, and a data desensitization method based on a characteristic rule desensitization section, which can avoid leakage of the desensitized data due to desensitization failure, is provided.

In order to overcome the defects of the prior art, the invention adopts the technical scheme that: a data desensitization method based on characteristic rule desensitization section, S01, select the goal data comprising sensitive data according to the request side; s02, acquiring characteristic rules containing sensitive data; s03, dividing the sensitive positions based on the characteristic rule to obtain sections to be desensitized; and S04, desensitizing the desensitization section.

Further, the desensitizing step of the desensitizing segment is: sequentially selecting desensitization segments, determining the effective value range and the potential rule of each bit of data in the desensitization segments according to the characteristic rule and the position to be desensitized, generating desensitization segment vectors, randomly selecting a prime number P, and calculating a desensitization value Y for each bit of original data x according to the desensitization formula_i：

Y_i＝Y_i-1+∑_j(x-t_j-P)modN

In the formula, i is desensitization times, j is data identification in an effective range corresponding to the position of x, and t_jFor the jth data in the valid range corresponding to the bit in which x is located, mod () is a remainder taking function, and N is included in the desensitization segmentThe total number of bits of data to be desensitized.

And further, screening a desensitization data value range by adopting a similarity calculation method to obtain processed desensitization data, comparing the desensitization data with the original data, if the desensitization data is the same as the original data, re-executing the sensitivity calculation and screening program, and if the desensitization data is different from the original data, completing desensitization.

Further, after desensitization, the feedback party obtains a sensitive data display value and compares the sensitive data with desensitization data, if the sensitive data display value is the same as the desensitization data, the desensitization data is converted into the display value after confirmation information is sent to the requesting party, and if the sensitive data display value is not the same as the desensitization data display value, the desensitization program is executed again.

Further, the target data includes an identification number, a unified social credit code, a phone number, an address, a license plate number, a bank account number, or a social account number.

By implementing the technical scheme of the invention, the section to be desensitized is obtained by dividing the sensitive position of the sensitive data based on the characteristic rule, the desensitization section data is only deformed according to the desensitization rule and is not hidden or shielded into valueless data, so that the desensitization data and the data characteristics are ensured, the application value is reserved, in addition, after the desensitization data is generated, a desensitization confirmation step is firstly carried out instead of direct display, the sensitive data is prevented from being leaked due to desensitization failure, and the information security is improved.

Detailed Description

The invention relates to a data desensitization method based on a characteristic rule desensitization section, which mainly comprises the steps of selecting target data containing sensitive data according to a request party, obtaining the characteristic rule containing the sensitive data, dividing sensitive positions based on the characteristic rule to obtain sections to be desensitized, desensitizing the desensitization sections and the like, and more specifically, an identity card is used as a desensitization case to describe desensitization steps.

Step 1, acquiring a data access request.

The data access request in this step is used to indicate that the requester requests to access the target data in the data server. Therefore, the data access request can be initiated by the data access terminal to the server, or initiated by a specific person (such as an operation and maintenance person), or initiated by a specific system (such as a business system).

For example, an application running on the terminal needs to access target data in the server, and the application sends a data access request to the server through a communication module of the terminal. At this time, the application program needs to present the target data, but the target data contains some sensitive data which needs to be hidden, at this time, the server cannot show all plaintext of the target data to the application, and after desensitization is performed on the target data, the desensitization data is fed back to the application on the terminal.

For another example, the operation and maintenance personnel need to maintain the table structure, perform system tuning, and the like. At the moment, the operation and maintenance personnel can launch target data containing the form to the server through the terminal. The form contains sensitive data that needs to be suppressed and the attention of the operation and maintenance personnel is the form structure rather than the contents of the form. At this point, the operation and maintenance personnel should be avoided from retrieving or exporting the real data. Therefore, after desensitizing the target data, the server feeds the desensitized data back to the application on the terminal.

For example, when other business systems perform data interaction with the business system, a data access request can be sent to the business system. When target data accessed by other service systems contains private data, desensitization processing needs to be carried out on the exchanged data, and desensitization data after desensitization is fed back to other service systems.

And 2, selecting target data according to the data access request, and judging whether the target data comprises sensitive data.

Taking the data access request as an SQL statement, such as SELECT Name, IDCard FROM Persons. By analyzing the SQL statement, the target data, i.e., the Name column and the IDCard column in the table named Persons, can be obtained. For the analysis of the SQL statement, which is not the point of the present disclosure, please refer to the existing SQL analysis method, and the description is omitted here.

After the database is created, sensitive data is labeled to indicate which line of data in the table is sensitive data, and desensitization processing is required. After the target data is obtained, whether the target data is sensitive data or not can be judged according to the label.

For example, a user creates a database by: CREATE TABLE PERsons

(

Id_P int,

LastName varchar(255),

FirstName varchar(255),

Address varchar(255),

City varchar(255)

IDCard nvarchar(20)

)

After creation, the user sets the data recorded in the Address column and the IDCard column in the Persons table as sensitive data. Then after the target data is the IDCard column data in the Persons table, the data can be determined to be sensitive data.

In addition, when the method is used for setting the sensitive data, the sensitivity can also be set. The sensitivity of the sensitive data is different, such as the business secret data with absolute privacy and the data with little influence after leakage, and the sensitivity is different, namely the sensitivity for describing the data. For example, the sensitivity of data recorded in the IDCard column is 3, the sensitivity of data recorded in the Address column is 2, and the larger the sensitivity value is, the more sensitive the data is. If the user defines the sensitivity, then the sensitivity is also recorded in this step.

And 3, if the target data does not comprise the sensitive data, feeding the target data back to the requester. And if the target data comprises sensitive data, performing characteristic desensitization processing on the sensitive data.

The proposal provides a characteristic desensitization processing scheme, and the data after desensitization still keeps the characteristics of the data and the application value of the data after desensitization.

1. Characteristic rules of the sensitive data are obtained. For example, the identity card number property rule is: the region address code (the first 6 bits), the birth date (the 7 th to 14 th bits), the sequence code (the 15 th to 17 th bits) and the check code (the 18 th bit). For another example, the mobile phone number characteristic rule is: network identification number (top 3 digits), area code (digits 4-7), subscriber number (digits 8-11).

2. A sensitive location of the sensitive data is acquired. Not all contents in the sensitive data need to be desensitized, for example, the identification number only needs to desensitize the 7 th to 17 th bits, and the 7 th to 17 th bits are sensitive positions at this time.

3. And dividing the sensitive position based on the characteristic rule to obtain a desensitization section.

Also for bits 7-17, the desensitization scheme may be different since bits 7-14 represent a different meaning than bits 15-17, resulting in bits 7-14 having a different range of values than bits 8-11.

For example, bits 7-14 represent the date of birth, where bit 7 can only be 1 or 2, and if it is a special symbol (e.g.. a.) or 0,3,4,5,6,7,8,9, it is clear that this number is false and has lost its property as the first year.

For years, such as 1977, if it is ensured that the desensitized values for each of the 1,9,7,7 bits are different from the original values, it is not necessarily ensured that the 4 bits taken together will represent a valid year. Such as 1 to 2, 9 to 0, 7 to 6, and year after desensitization to 2066, apparently not the effective year, still lost the value of the data after desensitization.

According to the proposal, sensitive data is divided into desensitization sections according to characteristic rules, for example, in an identity card number, the 7 th to 14 th bits are one section, and the 15 th to 17 th bits are one section. Or the 7 th to 10 th positions are a section, the 11 th to 12 th positions are a section, the 13 th to 14 th positions are a section, and the 15 th to 17 th positions are a section.

4. Desensitizing the desensitization segment

(1) Sequentially selecting a desensitization segment;

(2) and determining the effective value range and the potential rule of each bit of data in the desensitization section according to the characteristic rule and the sensitive position of the data to be desensitized.

For example, the desensitization segment is year, the first bit has a valid range of 1 and 2, the second bit has a valid range of 0 and 9, the third bit has a valid range of 0-9, and the fourth bit has a valid range of 0-9. Second bit 9 when the underlying rule is that the first bit is 1; when the first bit is 2, the second bit is 0.

(3) Generation of desensitization segment vectors (raw data, valid Range set, latent rules)

If the sender is a person, the sender may be a password of the person, and if the sender is a device, the sender may be an AMC address, an IP address, or the like of the device, and may be an identifier capable of uniquely identifying the device.

(4) Randomly selecting a prime number P, for each bit (such as x) of the original data, the desensitized value is Y_iAnd Y is_iThe number of digits is N.

Y_i＝Y_i-1+∑_j(x-t_j-P)modN

Desensitization formula, i is desensitization times, and is a preset value, such as 2 or 3. j is the data identification in the effective range corresponding to the position where x is located, tj is the jth data in the effective range corresponding to the position where x is located, mod () is a residue taking function, and N is the total number of data to be desensitized included in the desensitization segment.

If the number of bits of Yi is less than N, then complement 0 at high bit.

Also taking the year of the identification number as an example, the year value "1977" would yield a 4-digit desensitization value, such as ABCD, after performing the desensitization formula.

(5) And sequentially selecting one bit from the highest bit to determine whether the bit meets the corresponding effective range, and checking the second highest bit if the bit corresponds to the effective range. If not, the A and each numerical value in the effective range are subjected to similarity calculation, and the numerical value with the highest similarity result is selected as the numerical value. When the similarity calculation is performed, a cosine similarity calculation method is adopted, which is an existing method and is not described herein again.

For non-most significant bits, the underlying rule is considered in determining whether it satisfies the corresponding valid range. Namely, the corresponding value range is screened once based on the potential rule, and then whether the value range meets the screened effective range is determined.

If the most significant desensitized number is 2, the second bit would have valid ranges of 0 and 9, but based on the underlying rule, the first bit would be 2, the second bit would be 0, and the second bit's valid range would be screened for 0. It is sufficient to determine only whether the second bit after desensitization is 0.

(6) And (4) after the processed desensitization data is obtained, comparing the desensitization data with the original data, and if the desensitization data is the same as the original data, executing the steps (4) and (5) again until the processed desensitization data is different from the original data. And if the processed desensitization data are different from the original data, determining the processed desensitization data as final desensitization data.

The present proposal takes numbers as an example for explanation, and for other non-numeric forms such as characters, the binary values thereof can be used as numbers for processing, and the present proposal will not be described.

And 4, feeding back the final desensitized data to the data access request sender.

And 5, the feedback party acquires the desensitized data, the desensitized data is not directly displayed on a display screen of the feedback party, but a display signal of the feedback party is intercepted when the display card displays the desensitized data, and the display information is fed back. Such as feeding back the display information as a picture.

And 6, after the display information is obtained, analyzing a sensitive data display value in the display information, comparing the display value with the final desensitized data, and if the display value is the same as the final desensitized data, sending confirmation information to a data access request sender. If the data request is different from the desensitization request, indicating that the desensitization fails, sending termination information to the data request sender, and re-executing the steps (2) to (6) until the data request sender is the same.

And 7, after the data access request sender receives the confirmation information, accessing a display signal of the display card into the display screen to display the desensitization data on the display screen. And after the data access request sender receives the termination information, stopping outputting a display signal of the display card and waiting for displaying the new desensitized data.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data desensitization method based on a characteristic rule desensitization segment is characterized in that:

s01, selecting target data containing sensitive data according to the request party;

s02, acquiring characteristic rules containing sensitive data;

s03, dividing the sensitive positions based on the characteristic rule to obtain sections to be desensitized;

and S04, desensitizing the desensitization section.

2. A method of data desensitization based on property rule desensitization segments according to claim 1, wherein said step of desensitizing the desensitization segments is:

sequentially selecting desensitization segments, determining the effective value range and the potential rule of each bit of data in the desensitization segments according to the characteristic rule and the position to be desensitized, generating desensitization segment vectors, randomly selecting a prime number P, and calculating a desensitization value Y for each bit of original data x according to the desensitization formula_i：

Y_i＝Y_i-1+∑_j(x-t_j-P)modN

In the formula, i is desensitization times, j is a data identifier in an effective range corresponding to a bit where x is located, tj is the jth data in the effective range corresponding to the bit where x is located, mod () is a residue taking function, and N is the total number of data to be desensitized included in the desensitization segment.

3. The data desensitization method based on the characteristic rule desensitization segment according to claim 2, characterized in that a desensitization data value range is screened by adopting a similarity calculation method to obtain processed desensitization data, the processed desensitization data is compared with original data, if the desensitization data is the same as the original data, the sensitivity calculation and screening procedures are re-executed, and if the desensitization data is different from the original data, desensitization is completed.

4. The data desensitization method according to any one of claims 1 to 3, wherein after desensitization, the feedback party obtains the display value of the sensitive data and compares the display value with the desensitization data, if the display value is the same as the desensitization data, the feedback party sends a confirmation message to the requesting party and then converts the desensitization data into the display value, and if the display value is not the same as the desensitization data, the desensitization procedure is executed again.

5. The method of data desensitization based on characteristic rule desensitization segments according to claim 1, wherein the target data includes identification numbers, uniform social credit codes, phone numbers, addresses, license plate numbers, bank account numbers, or social account numbers.