CN108776762B

CN108776762B - Data desensitization processing method and device

Info

Publication number: CN108776762B
Application number: CN201810586230.9A
Authority: CN
Inventors: 林鸿; 欧阳红; 袁葆; 江再玉; 赵加奎; 熊根鑫; 王宇坤; 于喻; 宋振世; 王奕; 郑倩
Original assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Beijing China Power Information Technology Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Beijing China Power Information Technology Co Ltd
Priority date: 2018-06-08
Filing date: 2018-06-08
Publication date: 2022-01-28
Anticipated expiration: 2038-06-08
Also published as: CN108776762A

Abstract

The application provides a data desensitization processing method and device, which are used for determining the type of target data; calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data, and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data; and determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization treatment on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data. The target data is segmented to obtain data with a certain structure, desensitization treatment is carried out on the part with main sensitive information, and mask treatment is carried out on all or most of the sensitive information, so that the effectiveness of data desensitization is improved, the safety of data assets is guaranteed, the safety of client information is protected to the maximum extent, and client information leakage caused by abnormal inquiry, export and other modes is avoided.

Description

Data desensitization processing method and device

Technical Field

The invention relates to the technical field of data processing, in particular to a data desensitization processing method and device.

Background

In order to meet the working requirements of the national network security law on protection of client sensitive information, guarantee the data asset security of power marketing clients, guarantee the legal rights and interests of the power marketing clients and perform data desensitization on the power marketing client sensitive information, the aim is to protect the security of the power client information to the maximum extent while meeting the normal business requirements and avoid the leakage of the power client information caused by abnormal inquiry, derivation and other modes.

At present, a mask desensitization method is mainly adopted by a main power marketing data desensitization rule, partial information is reserved, the length of the information is ensured to be unchanged, and the main rule is as follows:

(1) contact address

The format is as follows: the format is not fixed and is a character string with an indefinite length.

Desensitization rules: the method comprises the steps of reserving words with lengths of 5 words or less, reserving the 1 st word and the last 2 words according to the lengths in a stepped manner; the length of the word is 6-9, and the last 5 words are reserved; length is 10 words and above, and the 4 words before the last 5 words are hidden; the hidden word is replaced with a.

(2) Enterprise family name

The format is as follows: the enterprise family name is consistent with the business license, is a company name and consists of a plurality of Chinese characters.

Desensitization rules: step-by-step reservation according to length: the length of the character is 4 or less, and the head and the tail of the character are respectively reserved with 1 character; the length of the Chinese characters is 5-6, and the head and the tail of the Chinese characters are respectively reserved with 2 words; odd numbers with the length of 7 words or more, and 3 words in the middle are hidden; the length of the character is 8 or more even numbers, and the middle 4 characters are hidden; the hidden word is replaced with a.

The main disadvantages of the existing power marketing data desensitization rules are:

after data desensitization is carried out on the electricity utilization address and the enterprise family electric power marketing data according to the current data desensitization rule, the data are not subjected to keyword mask codes, and keywords are reserved. For example, according to the desensitization rule of the enterprise-type username, sensitive information may still exist in the desensitized username address, part of keywords are reserved, and the desensitization effect is not obvious. As follows: qingdao Huifeng Motor manufacturing company- > Qingdao Huifeng x company Limited; islands two zero two business services, Inc. - > Islands two X A Business, Inc.

Similar problems exist with the desensitization rule for contact addresses, as follows: 2-1-101 of the north Jue of the Shanchuan avenue in the middle of Jinan City in Shandong province, and 1-101 of the north Jue of the Shanchuan ave in the middle of Jinan City in Shandong province.

Disclosure of Invention

In view of the above, the invention discloses a processing method and a device for data desensitization, which are used for performing word segmentation on target data by calling a word segmentation reference word bank before data desensitization so as to realize more effective data desensitization.

In order to achieve the above purpose, the invention provides the following specific technical scheme:

a method of data desensitization processing, comprising:

determining the type of the target data;

calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data, and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data;

and determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization treatment on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data.

Optionally, the method further includes:

and constructing a participle reference word bank, wherein the participle reference word bank comprises a plurality of sub-word banks, and each sub-word bank comprises a type of sensitive word.

Optionally, when the type of the target data is a power utilization address, the calling a corresponding sub-lexicon in a word segmentation reference lexicon according to the type of the target data, and performing word segmentation by using a word segmentation method corresponding to the type of the target data includes:

and calling a general address sub-word library, a place name sub-word library, a cell name sub-word library and a administrative district division set sub-word library, and performing word segmentation on the target data by adopting maximum forward matching Chinese word segmentation.

Optionally, when the type of the target data is an enterprise-class username, the invoking a corresponding sub-lexicon in a participle reference lexicon according to the type of the target data, and performing participle by using a participle method corresponding to the type of the target data, includes:

and calling the regional set sub-word library, the industry set sub-word library and the company organization set sub-word library, and performing word segmentation by adopting a bidirectional maximum matching Chinese word segmentation method.

Optionally, before the determining a desensitization method of the target data according to the type of the target data and the length of the target data, the method further includes:

calculating the accuracy of the word segmentation result of the target data;

judging whether the correctness of the word segmentation result of the target data is greater than a first preset value or not;

if yes, executing the desensitization method for the target data according to the type of the target data and the length of the target data;

and if not, performing word segmentation on the target data based on a hidden Markov model, and executing the desensitization method for the target data according to the type of the target data and the length of the target data.

Optionally, when the type of the target data is an electricity utilization address, determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization processing on sensitive data obtained after word segmentation of the target data by using the desensitization method of the target data, includes:

judging whether the length of the target data is larger than a second preset value or not;

when the length of the target data is larger than the second preset value, determining that the desensitization method of the target data is a first electrical address data desensitization method;

extracting the last 5 bits of data of the house number data and the data of provincial, city, district and county of the house number data from the word segmentation result of the target data by adopting the first user address data desensitization method to obtain the remaining part data;

reserving the last 5 bits of data of the house number data and the province, city, district and county data, and masking the rest data of the target data to obtain data after the target data is desensitized;

when the length of the target data is not greater than the second preset value, determining that the desensitization method of the target data is a second electrical address data desensitization method;

and extracting the reserved part of the target data according to the length of the target data and a first stepped reservation rule by adopting the second user address data desensitization method, and masking the rest part of the target data to obtain the data after the target data desensitization.

Optionally, when the type of the target data is an enterprise-type username, determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization processing on sensitive data obtained after word segmentation of the target data by using the desensitization method of the target data, including:

judging whether the length of the target data is larger than a third preset value or not;

when the length of the target data is larger than the third preset value, determining that the desensitization method of the target data is a first enterprise username data desensitization method;

extracting the first character of the character size data and the last character of the industry data from the word segmentation result of the target data by adopting the first enterprise username data desensitization method to obtain the residual data of the character size data and the residual data of the industry data;

masking the residual data of the word size data and the residual data of the industry data, and reserving other data of the target data to obtain desensitized data of the target data;

when the length of the target data is not larger than the third preset value, determining that the desensitization method of the target data is a second enterprise username data desensitization method;

and extracting the reserved part of the target data according to the length of the target data and a second hierarchical reservation rule by adopting the second enterprise username data desensitization method, and masking the rest part of the target data to obtain the data after the target data desensitization.

A processing apparatus for data desensitization, comprising:

a type determination unit for determining a type of the target data;

the first word segmentation processing unit is used for calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data;

and the desensitization processing unit is used for determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization processing on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data.

Optionally, the apparatus further comprises:

the word bank building unit is used for building a participle reference word bank, the participle reference word bank comprises a plurality of sub-word banks, and each sub-word bank comprises one type of sensitive word.

Optionally, when the type of the target data is an electricity address, the first word segmentation processing unit is specifically configured to:

Optionally, when the type of the target data is an enterprise-type username, the first word segmentation processing unit is specifically configured to:

Optionally, the apparatus further comprises:

the calculation unit is used for calculating the accuracy of the word segmentation result of the target data;

the judging end member is used for judging whether the accuracy of the word segmentation result of the target data is greater than a first preset value or not;

if yes, triggering the desensitization processing unit;

and if not, triggering a second word segmentation processing unit, wherein the second word segmentation processing unit is used for segmenting the target data based on the hidden Markov model and triggering the desensitization processing unit.

Optionally, when the type of the target data is an electrical address, the desensitization processing unit includes:

the first judgment subunit is used for judging whether the length of the target data is greater than a second preset value or not;

a first determining subunit, configured to determine that the desensitization method of the target data is a first electrical address data desensitization method when the length of the target data is greater than the second preset value;

the first extraction subunit is used for extracting the last 5-bit data of the house number data and the provincial, municipal and county data from the word segmentation result of the target data by adopting the first user address data desensitization method to obtain the remaining part data;

the first desensitization processing subunit is used for reserving the last 5 bits of data of the house number data and the data of province, city, district and county, and masking the remaining data of the target data to obtain the data after the target data is desensitized;

a second determining subunit, configured to determine that the desensitization method of the target data is a second electrical address data desensitization method when the length of the target data is not greater than the second preset value;

and the second desensitization processing subunit is used for extracting the reserved part of the target data according to the length of the target data and a first stepped reservation rule by adopting the second user address data desensitization method, and masking the remaining part of the target data to obtain data after the target data is desensitized.

Optionally, when the type of the target data is an enterprise-type username, the desensitization processing unit includes:

the second judgment subunit is used for judging whether the length of the target data is greater than a third preset value or not;

the third determining subunit is configured to determine that the desensitization method of the target data is a first enterprise-class username data desensitization method when the length of the target data is greater than the third preset value;

a second extraction subunit, configured to extract, by using the first enterprise-class username data desensitization method, a first word of the word size data and a last word of the industry data from the word segmentation result of the target data, and obtain remaining data of the word size data and remaining data of the industry data;

the third desensitization processing subunit is used for masking the residual data of the word size data and the residual data of the industry data, reserving other data of the target data, and obtaining data after the target data is desensitized;

a fourth determining subunit, configured to determine that the desensitization method of the target data is a second enterprise-class username data desensitization method when the length of the target data is not greater than the third preset value;

and the fourth desensitization processing subunit is configured to extract, by using the second enterprise-class username data desensitization method, the reserved portion of the target data according to the length of the target data and according to a second hierarchical reservation rule, and mask the remaining portion of the target data to obtain data after the target data is desensitized.

Compared with the prior art, the invention has the following beneficial effects:

according to the data desensitization processing method and device, before data desensitization, word segmentation is carried out on target data by calling the word segmentation reference word bank to obtain data with a certain structure, desensitization processing is carried out on parts with main sensitive information, all or most of the sensitive information is masked, and effectiveness of data desensitization is improved. The corresponding sub-word banks in the word segmentation reference word bank are called according to the type of the target data, word segmentation is carried out by adopting a word segmentation method corresponding to the type of the target data, the word segmentation accuracy is improved, a desensitization method of the target data is determined according to the type and the length of the target data, differential desensitization of different types of data with different lengths is realized, and the effectiveness of data desensitization is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a data desensitization processing method according to an embodiment of the present invention;

FIG. 2 is a diagram of a universal address sub-vocabulary according to an embodiment of the present invention;

FIG. 3 is a diagram of a local name thesaurus sub-thesaurus disclosed in an embodiment of the present invention;

FIG. 4 is a diagram of a cell name sub-lexicon as disclosed in an embodiment of the present invention;

FIG. 5 is a diagram of a sub-thesaurus of administrative region division sets disclosed in an embodiment of the present invention;

FIG. 6 is a diagram of a regional collection subword library disclosed in an embodiment of the present invention;

FIG. 7 is a schematic diagram of an industry aggregate sub-word library disclosed in an embodiment of the present invention;

FIG. 8 is a diagram of a corporate organization collection subword library disclosed in an embodiment of the present invention;

FIG. 9 is a diagram illustrating a maximum forward matching Chinese word segmentation method according to an embodiment of the present invention;

FIG. 10 is a flowchart of a method for desensitizing processing of electrical address data according to an embodiment of the present invention;

FIG. 11 is a flowchart of a method for desensitizing enterprise-type username data according to an embodiment of the present invention;

FIG. 12 is a flow chart of another method of data desensitization processing according to the disclosed embodiments;

fig. 13 is a schematic structural diagram of a processing apparatus for data desensitization according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present embodiment discloses a data desensitization processing method, which specifically includes the following steps:

s101: determining the type of the target data;

the target data is data which needs desensitization processing, and the types of the target data can include telephone data, address data, user name data, bank account data and the like.

S102: calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data, and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data;

word segmentation is the segmentation of a sequence of Chinese characters into a single word. Word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification.

In order to more accurately perform word segmentation on the target data, the corresponding sub-word libraries in the word segmentation reference word library are called according to the type of the target data to perform word segmentation on the target data.

It should be noted that the processing method for data desensitization further includes:

and constructing a word segmentation reference word bank.

The word segmentation reference word bank comprises a plurality of sub-word banks, and each sub-word bank comprises a type of sensitive word.

Referring to fig. 2 to 8, the general address sub-word library, the local name sub-word library, the cell name sub-word library, the administrative district division set sub-word library, the area set sub-word library, the industry set sub-word library and the company organization set sub-word library in the participle reference word library are respectively shown.

In order to more accurately perform word segmentation on target data, calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data, and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data. For example, when the type of the target data is a power utilization address, a general address sub-word library, a place name sub-word library, a cell name sub-word library and a administrative division set sub-word library are called, and the target data is segmented by adopting maximum forward matching Chinese segmentation words. And when the type of the target data is an enterprise family name, calling the regional set sub-word library, the industry set sub-word library and the company organization set sub-word library, and performing word segmentation by adopting a bidirectional maximum matching Chinese word segmentation method.

As shown in fig. 9, the maximum forward matching chinese word segmentation algorithm is adopted when segmenting the electric address data, and the specific algorithm is as follows:

several consecutive characters in the target data are matched with the word list from left to right, and if the matching is positive, a word is cut out. But there is a problem here: to achieve maximum matching, it is not possible to split the first match. If the text to be participled is:

content [ ] { "flood", "mountain", "street", "double", "river", "society", "district", … … }

A word list: dit [ ] { "Changsha city", "Kaifu district", "Hongshan street", … … }

(1) Starting from content [1], when content [2] is scanned, it is found that "Hongshan" is already in the vocabulary dit [ ]. But not yet separable because we do not know that the following words cannot constitute longer words (maximum match);

(2) continuing to scan content [3], it was found that "Hongshan street" was not the word in dit [ ]. But we cannot yet determine if the previously found "Hongshan" is already the largest word, because "Hongshan street" is the prefix of dit [2 ];

(3) content [4] was scanned and "Hongshan street" was found to be a word in dit [ ]. Continuing to scan;

(4) when content [5] is scanned, it is found that "Hongshan street double" is not a word in the vocabulary, nor is it a prefix of a word. Therefore, the word "Hongshan street" with the largest front can be cut.

It follows that the maximum matched word must ensure that the next scan does not end with a word or prefix of a word in the vocabulary. And (5) continuing to circulate by utilizing a maximum forward matching algorithm to finish the residual word segmentation. For example, the final word segmentation result of the address of "the current generation ten-thousand-country city three-period 10-two-unit 1706 of the great mountain street, the double river community, the Fuyuan West road 199 and the like in the Kangfu area of Changsha city" is as follows:

A bidirectional maximum matching Chinese word segmentation method is adopted when the enterprise family name data is segmented. The bidirectional maximum matching Chinese word segmentation method comprises the steps of firstly respectively carrying out maximum forward matching and maximum reverse matching Chinese word segmentation, comparing word segmentation results on the basis, adopting different word segmentation strategies according to different results, and selecting one word segmentation result to output according to the principle that the more words with large granularity are better, the less words with non-dictionary words and single words are better, for example.

The maximum forward matching chinese word segmentation algorithm has been described in detail. The maximum reverse matching Chinese word segmentation algorithm is similar to the maximum forward matching algorithm, and the difference is the scanning direction, namely, the substrings are taken from the right to the left for matching. The algorithm flow can be described as:

(1) inputting a preprocessed sentence content to be participled, and initializing an index as content.

(2) Obtaining the length of each sub-dictionary in the dictionary database;

(3) obtaining the length of word segmentation words, comparing the length with the longest sub-dictionary in the dictionary database, if the maximum length of the sub-dictionary is larger than the length of the word to be segmented, taking the character string remaining in the word to be segmented as the maximum length, otherwise, segmenting the word by the maximum length;

(4) searching the sub dictionary with the same maximum matching length as the current dictionary by using a dichotomy, if the sub dictionary is found, turning (5), and if the sub dictionary is not found, subtracting one from the maximum length (4);

(5) obtaining a character string SubStr to be participled, finding the character string in a dictionary, if the character string is found, adding the character string into the List, if the character string is not found, judging whether the SubStr is larger than 1, if the character string is larger than 1, deleting the last character of the SubStr (5), otherwise, setting a segmentation mark, and turning (6);

(6) and judging whether the Index is larger than 1, if so, turning to (3), otherwise, saving the List, and exiting.

The bidirectional maximum matching algorithm combines the forward matching algorithm and the reverse matching algorithm, for character strings to be segmented, firstly, the maximum forward matching algorithm and the maximum reverse matching algorithm are respectively used for segmenting words, word segmentation results are compared, the forward maximum matching and the reverse maximum matching are compared, and word segmentation results are returned; when the word segmentation results in the two directions are consistent, the returned character strings are not consistent, and the returned length is small; when the lengths are consistent, the reverse direction is returned. The bidirectional maximum matching Chinese word segmentation algorithm comprises the following steps:

(1) inputting a sentence content to be participled;

(2) preprocessing the content, then performing word segmentation by using a maximum forward matching algorithm and a maximum reverse matching algorithm respectively, comparing word segmentation results, turning to (3) if the word segmentation results are completely the same, and turning to (4) if the word segmentation results are different;

(3) randomly selecting one word segmentation result, and finishing the output algorithm of the word segmentation result;

(4) comparing whether the word segmentation numbers are the same or not, if so, selecting a reverse word segmentation result, outputting the word segmentation result, and ending the algorithm; otherwise, selecting the word segmentation result with smaller word segmentation number for output, and ending the algorithm.

S103: and determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization treatment on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data.

Referring to fig. 10, when the type of the target data is a power consumption address, the execution process of S103 is as follows:

s201: judging whether the length of the target data is larger than a second preset value or not; if yes, go to S202, if no, go to S203:

s202: determining that the desensitization method of the target data is a first electrical address data desensitization method;

s204: extracting the last 5 bits of data of the house number data and the data of provincial, city, district and county of the house number data from the word segmentation result of the target data by adopting the first user address data desensitization method to obtain the remaining part data;

s205: reserving the last 5 bits of data of the house number data and the province, city, district and county data, and masking the rest data of the target data to obtain data after the target data is desensitized;

s203: determining that the desensitization method of the target data is a second electrical address data desensitization method;

s206: and extracting the reserved part of the target data according to the length of the target data and a first stepped reservation rule by adopting the second user address data desensitization method, and masking the rest part of the target data to obtain the data after the target data desensitization.

For example, for the electricity address data with the length of 10 words and less, data desensitization is carried out according to a second user address data desensitization method, stepped reservation is carried out according to the length, the length of 5 words and less, the 1 st word and the last 2 words are reserved; 6-9 words in length, the last 5 words being reserved.

Data desensitization is performed by a first user address data desensitization method for power consumption address data having a length of 10 words or more. The electricity utilization address generally comprises province, city, county, street/village committee/village, road, district and house number. The house number part is reserved with the last 5 digits, province, city and county are reserved, and other parts are all replaced by X. As follows:

the central region of Shandong province, Jinan City, Shanchuan avenue, North Jujue, three-way Qilu Ankang district 2-1-101, the central region of Shandong province, Jinan City, Shandong province, 1-101.

Referring to fig. 11, when the type of the target data is a power consumption address, the execution process of S103 is as follows:

s301: judging whether the length of the target data is larger than a third preset value or not; if yes, executing S302, otherwise executing S303;

s302: determining that the desensitization method of the target data is a first enterprise username data desensitization method;

s304: extracting the first character of the character size data and the last character of the industry data from the word segmentation result of the target data by adopting the first enterprise username data desensitization method to obtain the residual data of the character size data and the residual data of the industry data;

s305: masking the residual data of the word size data and the residual data of the industry data, and reserving other data of the target data to obtain desensitized data of the target data;

s303: determining that the desensitization method of the target data is a second enterprise username data desensitization method;

s306: and extracting the reserved part of the target data according to the length of the target data and a second hierarchical reservation rule by adopting the second enterprise username data desensitization method, and masking the rest part of the target data to obtain the data after the target data desensitization.

For example, enterprise-class account name data with the length of less than 6 words is subjected to data desensitization according to a second electric address data desensitization method, the enterprise-class account name data are reserved according to the length in a stepped mode, the enterprise-class account name data with the length of 4 words or less are reserved for 1 word from head to tail; the length of the Chinese character is 5-6 words, and the head and the tail of the Chinese character are respectively reserved with 2 words.

And carrying out data desensitization on enterprise username data with the length of 6 words or more according to a first electric address data desensitization method. The enterprise family name generally comprises four parts of area, word size, industry and company organization. And reserving the areas and the organization parts before and after the operation, and performing mask operation on the word size and the industry. The first word is reserved in the word size part, and other parts are completely replaced by words; the industry part reserves the last word, and the other parts are all replaced by words. As follows:

qingdao Huifeng Motor manufacturing company Limited- > Qingdao Hui xi company Limited;

islands two zero two business services, Inc. - > Islands two X A Business, Inc.

According to the data desensitization processing method disclosed by the embodiment, before data desensitization, word segmentation is performed on target data by calling a word segmentation reference word bank to obtain data with a certain structure, desensitization processing is performed on parts with main sensitive information, all or most of the sensitive information is masked, and the effectiveness of data desensitization is improved. The corresponding sub-word banks in the word segmentation reference word bank are called according to the type of the target data, word segmentation is carried out by adopting a word segmentation method corresponding to the type of the target data, the word segmentation accuracy is improved, a desensitization method of the target data is determined according to the type and the length of the target data, differential desensitization of different types of data with different lengths is realized, and the effectiveness of data desensitization is improved.

Referring to fig. 12, the present embodiment discloses another data desensitization processing method, which specifically includes the following steps:

s401: determining the type of the target data;

s402: calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data, and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data;

s403: calculating the accuracy of the word segmentation result of the target data;

s404: judging whether the correctness of the word segmentation result of the target data is greater than a first preset value or not; if yes, executing S405, otherwise, executing S406;

s405: determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization treatment on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data;

s406: and segmenting the target data based on a hidden Markov model, and executing S405.

And carrying out Chinese word segmentation on the two types of data of enterprise class house names and power utilization addresses by adopting a Hidden Markov Model (HMM Hidden Markov Model). Under the condition that the training corpus is large enough in scale and the coverage field is enough, the HMM algorithm can obtain higher segmentation accuracy. The word segmentation algorithm models Chinese based on the part of speech and statistical characteristics of manual labeling, namely, model parameters are estimated and trained according to observed data (labeled corpora). And in the word segmentation stage, the probability of the occurrence of various word segmentations is calculated through a model, and the word segmentation result with the maximum probability is taken as a final result. The common sequence labeling model has an HMM algorithm, the algorithm can well process the problems of ambiguity and unknown words, and the effect is better than that of matching based on character strings.

The hidden markov model is a double stochastic process, we do not know specific state sequences, and only know the probability of state transition, i.e. the state transition process of the model is not observable (hidden), while the stochastic process of observable events is a stochastic function of the hidden state transition process.

The composition of the HMM includes:

the number of states in the model is N;

a different number of symbols M that may be output from each state;

state transition probability matrix a ═ a_ijWherein a is_ijIs in a state S_iTransition to State S_jThe probability of (d);

from state C_jObserving a specific symbol O_kAm (a)The rate distribution matrix is: b ═ B_j(k) The probability of observing a symbol is also called symbol emission probability;

the probability distribution of the initial state is: pi ═ pi_i}。

In general, an HMM is denoted as a five-tuple μ ═ C, K, a, B, and pi, where C is the set of states, O is the set of output symbols, and pi, a and B are the probability distribution of initial states, the probability of state transitions, and the probability of symbol emissions, respectively.

Chinese segmentation uses corpora to train HMMs. Using the classical character notation model, the set of four classes of labels C is C ═ B, E, M, S, and has the following meaning:

b: beginning of a word

E: ending of a word

M: middle of a word

S: single character and word

After the four types of labels are marked, an HMM model can be established by a statistical method, and the label classification of each character is only influenced by the classification of the previous character. And obtaining a state transition matrix A and a symbol emission probability B of the HMM. Wherein:

in the formula, C ═ { B, E, M, S }, O ═ character set }, and Count represents frequency. In the calculation of B_ijMeanwhile, due to the sparsity of data, many characters do not appear in the training set, which results in the result with probability 0 appearing in B, and in order to fix this problem, a data smoothing technique of adding 1 is adopted, that is:

we set the initial vector pi to {0.5, 0.0, 0.0, 0.5}, M and E cannot occur at the top of a sentence. So far, the HMM model is completely built. Based on this HMM model, for an observation sequence, a hidden sequence { B, E, M, S } is obtained using the Viterbi algorithm.

The Viterbi search algorithm is:

1. initialization: delta₁(i)＝π_ib_i(O₁),1≤i≤N,

Path variable with the highest probability:

2. and (3) recursive calculation:

3. memorizing a rollback path:

4. and (3) finalization:

get path (state sequence) by backtracking:

the time complexity of the Viterbi algorithm is O (N)²T). For example, the output state sequence of the address of the current generation ten-thousand city three-phase 10-span two-unit 1706 of the Hongshan street, two rivers and community Fuyuan Weilu 199 in the open areas of Changsha city is as follows:

“BMEBMEBMMEBMMEBMMEBMMEBMMMMMEBMEBMEBMME”

according to the state sequence, Chinese word segmentation can be carried out as follows:

the final Chinese word segmentation results are as follows:

The processing method for data desensitization disclosed in this embodiment first performs word segmentation processing on target word segmentation by using a maximum forward matching method or a bidirectional maximum matching chinese word segmentation method with a small algorithm complexity, so that the processing speed of word segmentation processing is ensured. And when the word segmentation result accuracy is lower than a threshold value, a hidden Markov model with higher algorithm complexity and higher word segmentation accuracy is adopted to segment the target data, so that the accuracy of the word segmentation result is ensured.

Referring to fig. 13, the present embodiment correspondingly discloses a processing apparatus for data desensitization, which includes:

a type determining unit 501 for determining the type of the target data;

a first word segmentation processing unit 502, configured to call a corresponding sub-word bank in a word segmentation reference word bank according to the type of the target data, and perform word segmentation by using a word segmentation method corresponding to the type of the target data;

a desensitization processing unit 503, configured to determine a desensitization method for the target data according to the type of the target data and the length of the target data, and perform desensitization processing on sensitive data obtained after word segmentation of the target data by using the desensitization method for the target data.

Optionally, the apparatus further comprises:

Optionally, when the type of the target data is an electricity address, the first word segmentation processing unit 502 is specifically configured to:

Optionally, when the type of the target data is an enterprise-type username, the first word segmentation processing unit 502 is specifically configured to:

Optionally, the apparatus further comprises:

if yes, triggering the desensitization processing unit;

Optionally, when the type of the target data is an electrical address, the desensitization processing unit 503 includes:

Optionally, when the type of the target data is an enterprise-type username, the desensitization processing unit 503 includes:

According to the data desensitization processing device disclosed by the embodiment, before data desensitization, word segmentation is performed on target data by calling the word segmentation reference word bank to obtain data with a certain structure, desensitization processing is performed on parts with main sensitive information, all or most of the sensitive information is masked, and the effectiveness of data desensitization is improved. The corresponding sub-word banks in the word segmentation reference word bank are called according to the type of the target data, word segmentation is carried out by adopting a word segmentation method corresponding to the type of the target data, the word segmentation accuracy is improved, a desensitization method of the target data is determined according to the type and the length of the target data, differential desensitization of different types of data with different lengths is realized, and the effectiveness of data desensitization is improved.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of data desensitization processing, comprising:

determining the type of the target data;

determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization treatment on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data;

when the type of the target data is the electricity utilization address, determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization processing on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data, wherein the desensitization processing comprises the following steps: judging whether the length of the target data is larger than a second preset value or not; when the length of the target data is larger than the second preset value, determining that the desensitization method of the target data is a first electrical address data desensitization method; extracting the last 5-bit data of the house number data and the provincial, municipal and county data from the word segmentation result of the target data by adopting the first power utilization address data desensitization method to obtain the remaining part data; reserving the last 5 bits of data of the house number data and the province, city, district and county data, and masking the rest data of the target data to obtain data after the target data is desensitized; when the length of the target data is not greater than the second preset value, determining that the desensitization method of the target data is a second electrical address data desensitization method; and extracting a reserved part of the target data according to the length of the target data and a first stepped reservation rule by adopting the second electrical address data desensitization method, and masking the rest part of the target data to obtain the data after the target data desensitization.

2. The method of claim 1, further comprising:

3. The method according to claim 1, wherein when the type of the target data is a power address, the calling a corresponding sub-lexicon in a participle reference lexicon according to the type of the target data, and performing participle by using a participle method corresponding to the type of the target data comprises:

4. The method according to claim 1, wherein when the type of the target data is an enterprise-class username, the calling a corresponding sub-thesaurus in a participle reference thesaurus according to the type of the target data, and performing participle by using a participle method corresponding to the type of the target data comprises:

5. The method of claim 1, wherein prior to the determining a desensitization method of the target data based on the type of the target data and the length of the target data, the method further comprises:

calculating the accuracy of the word segmentation result of the target data;

6. The method according to claim 1, wherein when the type of the target data is an enterprise-class username, determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization processing on sensitive data obtained after word segmentation of the target data by using the desensitization method of the target data comprises:

7. A processing apparatus for desensitizing data, comprising:

a type determination unit for determining a type of the target data;

the desensitization processing unit is used for determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization processing on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data;

wherein, when the type of the target data is a power-on address, the desensitization processing unit includes: the first judgment subunit is used for judging whether the length of the target data is greater than a second preset value or not; a first determining subunit, configured to determine that the desensitization method of the target data is a first electrical address data desensitization method when the length of the target data is greater than the second preset value; the first extraction subunit is used for extracting the last 5-bit data of the house number data and the provincial, municipal and county data from the word segmentation result of the target data by adopting the first electric address data desensitization method to obtain the remaining part data; the first desensitization processing subunit is used for reserving the last 5 bits of data of the house number data and the data of province, city, district and county, and masking the remaining data of the target data to obtain the data after the target data is desensitized; a second determining subunit, configured to determine that the desensitization method of the target data is a second electrical address data desensitization method when the length of the target data is not greater than the second preset value; and the second desensitization processing subunit is used for extracting the reserved part of the target data according to the length of the target data and a first stepped reservation rule by adopting the second electrical address data desensitization method, and masking the remaining part of the target data to obtain data after the target data is desensitized.

8. The apparatus of claim 7, further comprising:

9. The apparatus according to claim 7, wherein when the type of the target data is a power utilization address, the first word segmentation processing unit is specifically configured to:

10. The apparatus according to claim 7, wherein when the type of the target data is an enterprise-class username, the first participle processing unit is specifically configured to:

11. The apparatus of claim 7, further comprising:

if yes, triggering the desensitization processing unit;

12. The apparatus according to claim 7, wherein when the type of the target data is an enterprise-class username, the desensitization processing unit comprises: