CN117272996A

CN117272996A - Data desensitization system

Info

Publication number: CN117272996A
Application number: CN202311569787.9A
Authority: CN
Inventors: 卢国栋; 李静; 宋丙华; 罗倩倩; 王峰
Original assignee: Shandong Wangan Security Technology Co ltd
Current assignee: Shandong Wangan Security Technology Co ltd
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2023-12-22
Anticipated expiration: 2043-11-23
Also published as: CN117272996B

Abstract

The invention belongs to the field of data processing, and discloses a data desensitization system which comprises an acquisition module and a word segmentation module; the acquisition module is used for acquiring a text to be subjected to desensitization, wherein the text to be subjected to desensitization comprises a numerical text and a non-numerical text; the word segmentation module comprises a dictionary word segmentation unit and a statistical word segmentation unit; the dictionary word segmentation unit is used for carrying out word segmentation processing on the non-numerical text by adopting an improved dictionary-based word segmentation algorithm, storing the obtained words into a first word segmentation set, and obtaining a text without word segmentation based on the first word segmentation set; the statistical word segmentation unit is used for carrying out word segmentation processing on the text of the incomplete word segmentation by adopting a word segmentation algorithm based on statistics, and storing the obtained words into a second word segmentation set. The invention can ensure the word segmentation efficiency, and for words which do not exist in the dictionary, the word segmentation processing is carried out by the method with higher time complexity but without word segmentation algorithm through the dictionary, thereby ensuring the success rate of word segmentation.

Description

Data desensitization system

Technical Field

The invention relates to the field of data processing, in particular to a data desensitizing system.

Background

The data desensitization refers to the deformation of sensitive information through a desensitization rule so as to realize the reliable protection of sensitive and private data. This allows for the secure use of desensitized real data sets in development, testing, and other non-production and outsourcing environments. In the case of customer security data or certain business-sensitive data, the real data should be converted and provided for test use without violating system rules.

Data desensitization generally includes two steps, sensitive data identification and desensitizing the identified sensitive data. In the process of identifying sensitive data in a text to be identified, the prior art generally adopts a keyword identification and regular expression identification mode to identify the sensitive data. In the prior art, a single word segmentation algorithm is generally adopted to segment a text to be identified. If the dictionary-based word segmentation algorithm is directly adopted, the words which are not set in the dictionary are likely to be included in the text to be subjected to sensitive data recognition, so that more text which is not subjected to word segmentation still exists after the word segmentation, namely the success rate of word segmentation is lower. The success rate here is calculated from the ratio between the number of words that have been segmented and the total number of words in the entire text to be subjected to sensitive data recognition. If the word segmentation algorithm based on statistics is directly adopted, the word segmentation time is excessively long, because the time complexity of the word segmentation algorithm based on statistics is far higher than that of the word segmentation algorithm based on dictionary.

Disclosure of Invention

The invention aims to disclose a data desensitization system, which solves the problems of considering the word segmentation efficiency and the word segmentation success rate when the text is segmented in the data desensitization process, and reducing the time required by word segmentation on the premise of ensuring the success rate.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a data desensitization system comprises an acquisition module and a word segmentation module;

the acquisition module is used for acquiring a text to be subjected to desensitization, wherein the text to be subjected to desensitization comprises a numerical text and a non-numerical text;

the word segmentation module comprises a dictionary word segmentation unit and a statistical word segmentation unit;

the dictionary word segmentation unit is used for carrying out word segmentation processing on the non-numerical text by adopting an improved dictionary-based word segmentation algorithm, storing the obtained words into a first word segmentation set, and obtaining a text without word segmentation based on the first word segmentation set;

the statistical word segmentation unit is used for carrying out word segmentation processing on the text of the incomplete word segmentation by adopting a word segmentation algorithm based on statistics, and storing the obtained words into a second word segmentation set.

Preferably, the method further comprises a desensitization rule preservation module, wherein the desensitization rule preservation module is used for preserving preset desensitization processing rules of words of various types.

Preferably, the types of words include address class words, account class words, contact class words, and name class words.

Preferably, the desensitization processing rule of the words of the address class or the words of the name class is as follows:

replacing the words of the address class or the words of the name class by adopting randomly generated Chinese characters;

the desensitization processing rule of the account number type words or the contact way type words is as follows:

and replacing the number strings corresponding to the account number type words or the contact way type words by adopting random numbers.

Preferably, the number strings are obtained in the following manner:

and taking numerical text positioned right above, right below, left side and right side of the account word as a numerical string.

Preferably, the method further comprises a desensitization module, wherein the desensitization module is used for carrying out desensitization on the text to be subjected to desensitization based on the desensitization rule, the first word segmentation set and the second word segmentation set, so as to obtain the text after the desensitization.

Preferably, word segmentation processing is performed on the non-numerical text by adopting an improved dictionary-based word segmentation algorithm, the obtained words are saved to a first word segmentation set, and text with incomplete word segmentation is obtained based on the first word segmentation set, and the method comprises the following steps:

s1, acquiring a dictionary for word matching;

s2, calculating the self-adaptive sentence length of the non-numerical text;

s3, word segmentation processing is carried out on the non-numerical text based on the self-adaptive sentence length, the obtained words are stored in a first word segmentation set, and text with incomplete word segmentation is obtained based on the first word segmentation set.

Preferably, calculating the adaptive sentence length for the non-numeric text includes:

s21, obtaining the line number of the non-numerical text；

S22, calculating the maximum value of the random number：

；

Representing the number of random numbers that need to be generated;

s23, generatingThe value range is->Is to save the obtained random number to the set；

S24, based onIs associated with (a)Number of machines selected from non-numeric text +.>Line text saving to collection；

S25, word segmentation algorithm pair using hidden Markov modelWord segmentation processing is carried out on each line of texts in the set, and the obtained words are saved into a set +.>；

S26, obtainingThe number of words of various lengths in the sentence, the length with the largest number is taken as the adaptive sentence length of the non-numerical text +.>。

Preferably, word segmentation processing is performed on the non-numeric text based on the adaptive sentence length, the obtained words are saved to a first word segmentation set, and text with incomplete word segmentation is obtained based on the first word segmentation set, including:

s31, acquiring the front part in the non-numerical textIndividual characters form sentences to be segmented;

；

representing a maximum length of words in the dictionary; />The method comprises the steps of carrying out a first treatment on the surface of the mod represents a remainder operation;

s32, performing word segmentation on sentences to be segmented by using a forward maximum matching algorithm or a reverse maximum matching algorithm, and storing words which exist in the sentences to be segmented and belong to a dictionary into a first word segmentation set;

s33, representing the words which do not belong to the dictionary in the sentences to be segmented as；

S34, willBefore->Personal text->ThereafterIndividual words are words in the text of the incomplete word segmentation,/->Representation->The number of words contained in the document;

s35, deleting words in the first word segmentation set and characters in the text of the unfinished word segmentation from the non-numerical text;

s36, judging whether characters still exist in the non-numerical text, if so, entering S31, and if not, outputting a first word segmentation set.

Preferably, the device further comprises a database module for storing texts to be subjected to desensitization processing.

Compared with the prior art that a single word segmentation algorithm is adopted, the method comprehensively adopts two different word segmentation algorithms to segment the text to be subjected to desensitization, firstly adopts an improved dictionary-based word segmentation algorithm to segment the text to be subjected to desensitization, and obtains a first word segmentation set and an unfinished word segmentation text, and then uses a statistical-based word segmentation algorithm to segment the unfinished word text, so as to obtain a second word segmentation set. Because the time complexity of the word segmentation algorithm based on the dictionary is smaller, the word segmentation algorithm based on the dictionary can obtain most word segmentation results first, and can ensure the word segmentation efficiency, while for words which do not exist in the dictionary, the word segmentation algorithm based on the dictionary has higher time complexity but does not need to carry out word segmentation processing through the word segmentation algorithm of the dictionary, so that the success rate of word segmentation is ensured.

Drawings

The present disclosure will become more fully understood from the detailed description given herein below and the accompanying drawings, which are given by way of illustration only, and thus are not limiting of the present disclosure, and wherein:

FIG. 1 is a schematic diagram of a data desensitization system of the present invention;

FIG. 2 is a schematic diagram of a non-numeric text word segmentation process using the improved dictionary-based word segmentation algorithm of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

In one embodiment as shown in fig. 1, the invention provides a data desensitizing system, which comprises an acquisition module and a word segmentation module;

Preferably, the desensitization rule holding module includes a setting unit and a holding unit;

the setting unit is used for setting desensitization processing rules of each type of words by staff for face recognition;

the storing unit stores the desensitization processing rule set by the staff person who passes the face recognition in the setting unit in english.

Preferably, the setting unit performs the segmentation processing on the face image of the worker by using the following function in the face recognition process:

acquiring human faceSet of edge pixels in an image；

From the slaveIs selected randomly by a pixel point->；

Acquisition toIs central, is->Is a set of pixels in a window +.>；

Acquisition ofChinese meets the elliptical skin color model and is +.>The set of adjacent pixels of the pixels which do not conform to the elliptical skin color model +.>；

JudgingWhether or not the pixels in (a) are all +.>If not, will be +.>In not belonging to the->Pixel point of (2) is added to +.>In (a) and (b);

from the following componentsThe pixel points in the array form a plurality of connected domains;

and taking the connected domain with the largest area as an image of the face area.

In the prior art, when a face image is segmented by adopting a mode based on edge recognition, the obtained connected domain is generally and directly recognized after edge pixel points are obtained, but when the edge recognition is performed, the gray value distribution is complex, for example, the gray value variance is large, and obviously, more accurate image edges can not be obtained due to the fact that the distribution is complex, so that the pixel points of the areas are further segmented by an elliptical skin color model, and the accuracy of the obtained image edges can be improved. Thereby improving the accuracy of the subsequently acquired image features. The success rate of face recognition is improved.

Preferably, the setting unit is further configured to acquire image features of an image of the face area, and perform face recognition on the worker based on the acquired image features.

Specifically, the face recognition is performed on the staff based on the obtained image features, including:

and matching the image characteristics of the image of the face area with the facial characteristics of the pre-stored personnel capable of setting the desensitization processing rule, and if the matching is successful, indicating that the personnel face recognition.

Specifically, the types of the words can also include words of the certificate class, such as keywords of certificates of driving license, property license and the like.

In particular, it is also possible to useThe words of the address class or the words of the name class are replaced. For example, when the word of the address class is the d region of the a-province b-city, the desensitized data is +.>Province->City->A zone. For another example, when the term of the name class is b city gas limited, then the desensitized data is +.>Commercial gas limited. But for industry information of the company, the industry information can be removed as required.

Preferably, the number strings are obtained in the following manner:

Words of the account number class, such as an identification card number, are likely to appear behind, and possibly in front of, the words of the account number class. In addition, if the text to be desensitized is a form-like text, the digit string may also appear directly under or over the account-like word.

And for words of contact ways such as mailbox, telephone and the like, the processing rule is the same as the words of account number.

Specifically, the desensitization process is to randomly select a word from the first word segmentation set or the second word segmentation set, obtain a corresponding desensitization rule according to the type of the word, and perform the desensitization process on the word or the number string corresponding to the word based on the desensitization rule.

Preferably, as shown in fig. 2, the word segmentation processing is performed on the non-numerical text by adopting a modified word segmentation algorithm based on a dictionary, the obtained words are saved to a first word segmentation set, and the text with unfinished word segmentation is obtained based on the first word segmentation set, including:

s1, acquiring a dictionary for word matching;

s2, calculating the self-adaptive sentence length of the non-numerical text;

Specifically, words for word segmentation are stored in the dictionary, and word segmentation processing of the non-numeric text can be realized by matching words of various lengths in the non-numeric text with words in the dictionary.

s21, obtaining the line number of the non-numerical text；

S22, calculating the maximum value of the random number；

；

Representation ofThe number of random numbers that need to be generated;

S24, based onRandom number in (a) selected from non-numeric text +.>Line text saving to collection；

In the existing word segmentation algorithm based on the dictionary, the maximum value of the lengths of words in the dictionary is generally taken as the length of a sentence to be segmented. However, since the lengths of the words with high occurrence frequency are not the same in different texts, the maximum of the lengths of the words in the dictionary is usedIf the lengths of most words in the text are far from the maximum value, the word segmentation process can easily lead to dividing a complete word into two sentences in the process of acquiring sentences, so that the word segmentation process cannot be performed correctly on the word, the word segmentation process needs to be performed in the subsequent word segmentation process based on the statistical word segmentation algorithm, and obviously, the word number of the text to be processed in the statistical word segmentation algorithm is excessive, and the word segmentation process efficiency is affected. The present invention thus solves this problem by calculating an adaptive sentence length. The adaptive sentence length reflects as a sampleDistribution of word lengths in line text. The hidden Markov model is not affected by the dictionary during word segmentation and can therefore be used to model +.>And carrying out word segmentation on the line text. By solving the self-adaptive sentence length, the calculation result of the word segmentation algorithm based on statistics can influence the calculation process of the word segmentation algorithm based on the dictionary, and the probability of dividing a complete word into two sentences in the process of acquiring sentences when word segmentation processing is carried out by using the word segmentation algorithm based on the dictionary is reduced.

In addition, since the text to be desensitized is numerous in types, all the words which can appear are obviously difficult to record in the dictionary, the length of the word with the largest occurrence number can be known through sampling analysis from the non-numerical text, so that the length of the obtained sentence to be segmented can be correctly adjusted in the subsequent process, and the segmentation efficiency is improved.

By adopting a mode of generating random numbers, the influence of periodically repeated non-numerical text on the calculation of the self-adaptive sentence length can be avoided. Thereby improving the effectiveness of the adaptive sentence length. Because of the periodically repeated non-numeric text, if a fixed sampling interval is used to obtain the text as a sample, if the interval of the acquisition is exactly the same as the repetition period of the non-numeric text, the obtained sample actually contains only one type of line, which obviously results in a situation that the adaptive sentence length cannot be prepared to represent the length of the words in the non-numeric text.

In the present invention,much smaller than the actual number of lines of non-numeric text.

In particular, the method comprises the steps of,random number in (a) selected from non-numeric text +.>Line text is saved to the Collection->Comprising:

from the slaveA random number s is obtained;

calculate the need to save to the collectionNumber of rows in (a):

；

and->Representing the number of lines in the non-numeric text of the kth line and the kth-1 line text selected from the non-numeric text, respectively;

will not be the first in numerical textLine text is saved to the Collection->。

；

In the process of calculating the length of sentences to be segmented, the invention uses the method of the inventionPerforming remainder calculation if ∈>Just +.>If an integer multiple of one complete word is split into two sentences, the probability is significantly lower than +.>Not->Integer multiples of the probability at which the probability is to be determined. Because of->Just +.>When the number of the whole number times of the word is multiplied, the probability that each word can be correctly segmented by a dictionary-based word segmentation algorithm in sentences to be segmented is obviously improved.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments described above, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims

1. The data desensitization system is characterized by comprising an acquisition module and a word segmentation module;

2. The data desensitization system according to claim 1, further comprising a desensitization rule preservation module for preserving preset desensitization processing rules for words of multiple types.

3. A data desensitization system according to claim 2, wherein the types of words include address-like words, account-like words, contact-like words, and name-like words.

4. A data desensitisation system according to claim 3, wherein the rules for desensitising the words of the address class or the words of the name class are:

5. The data desensitization system according to claim 4, wherein the number strings are obtained by:

6. The data desensitization system according to claim 5, further comprising a desensitization module for desensitizing text to be desensitized based on the rule of desensitization, the first set of words, the second set of words, and obtaining desensitized text.

7. A data desensitization system according to claim 1, wherein the non-numeric text is segmented using a modified dictionary-based segmentation algorithm, the obtained words are saved to a first set of segments, and text that has not been segmented is obtained based on the first set of segments, comprising:

s1, acquiring a dictionary for word matching;

s2, calculating the self-adaptive sentence length of the non-numerical text;

8. A data desensitization system according to claim 7, wherein calculating adaptive sentence lengths for non-numeric text comprises:

s21, obtaining the line number of the non-numerical text；

S22, calculating the maximum value of the random number：

；

Representing the number of random numbers that need to be generated;

s23, generatingThe value range is->The obtained random number is saved to the set +.>；

S24, based onRandom number in (a) selected from non-numeric text +.>Line text is saved to the Collection->；

S25, makingWord segmentation algorithm pair using hidden Markov modelWord segmentation processing is carried out on each line of texts in the set, and the obtained words are saved into a set +.>；

9. The data desensitization system according to claim 8, wherein word segmentation processing is performed on the non-numeric text based on the adaptive sentence length, the obtained words are saved to a first word segmentation set, and text with incomplete word segmentation is obtained based on the first word segmentation set, comprising:

；

S34, willBefore->Personal text->Thereafter->Individual words are words in the text of the incomplete word segmentation,/->Representation->The number of words contained in the document;

10. A data desensitizing system according to claim 1, further comprising a database module for storing text to be desensitized.