CN108776762A

CN108776762A - A kind of processing method and processing device of data desensitization

Info

Publication number: CN108776762A
Application number: CN201810586230.9A
Authority: CN
Inventors: 林鸿; 欧阳红; 袁葆; 江再玉; 赵加奎; 熊根鑫; 王宇坤; 于喻; 宋振世; 王奕; 郑倩
Original assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Beijing China Power Information Technology Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Beijing China Power Information Technology Co Ltd
Priority date: 2018-06-08
Filing date: 2018-06-08
Publication date: 2018-11-09
Anticipated expiration: 2038-06-08
Also published as: CN108776762B

Abstract

This application provides a kind of processing method and processing devices of data desensitization, determine the type of target data；The corresponding sub- dictionary in participle benchmark dictionary is called according to the type of the target data, and is segmented using segmenting method corresponding with the type of the target data；According to the length of the type of the target data and the target data, the desensitization method of the target data is determined, and the sensitive data obtained after being segmented to the target data using the desensitization method of the target data carries out desensitization process.By being segmented to obtain the data with certain structure to target data; to there are the parts of sensitive prime information to carry out desensitization process; to the wholly or largely carry out mask of sensitive information; improve the validity of data desensitization; ensure data assets safety; the safety for utmostly protecting customer information avoids customer information caused by the modes such as improper inquiry, export from revealing.

Description

A kind of processing method and processing device of data desensitization

Technical field

The present invention relates to technical field of data processing, more particularly to a kind of processing method and processing device of data desensitization.

Background technology

To implement country《Network security method》About the job requirement of protection client-aware information, power marketing client is ensured Data assets safety, ensures power marketing client's legitimate rights and interests, needs to carry out data desensitization to power marketing client-aware information, Purpose is utmostly to protect the safety of electricity customer information while meeting regular traffic and needing, avoid improper inquiry, Electricity customer information is revealed caused by the modes such as export.

The main rule of power marketing data desensitization at present mainly uses mask desensitization method, member-retaining portion information to ensure letter The length of breath is constant, and main rule is as follows：

(1) contact addresses

Format：Format is not fixed, and is the character string of random length.

Desensitization rule：Retain by length sublevel ladder, 5 words of length and below, the 1st word of reservation and last 2 words；It is long 6-9 word of degree, retain last 5 words；Length is 10 words or more, conceals 4 words before last 5 words；It hides Word is replaced with *.

(2) enterprise-class name in an account book

Format：Enterprise-class name in an account book is consistent with business license, is Business Name, is made of several Chinese characters.

Desensitization rule：Retain by length sublevel ladder：4 words of length and below, head and the tail 1 word of each reservation；Length 5-6 Word, head and the tail respectively retain 2 words；7 words of length and the above odd number conceal intermediate 3 words；8 words of length and the above even number, it is hidden Remove intermediate 4 words；Word is hidden to be replaced with *.

The major defect of existing power marketing data desensitization rule is：

Electricity consumption address and this two classes power marketing data of enterprise-class family carry out data desensitization according to current data desensitization rule Afterwards, non-keyword mask, and keyword also maintains.For example, according to the desensitization rule of enterprise-class name in an account book, the name in an account book after desensitization Still there may be sensitive information, partial key is retained for address, desensitization effect unobvious.As follows：Qingdao Hui Feng Motor Manufacturing Co. Ltd->Qingdao Hui Feng * * * * Co., Ltds；2020 commerce services Co., Ltd of Qingdao->Qingdao Two * * * * * * business Co., Ltds.

According to the desensitization rule of contact addresses, there is also similar problems, as follows：Jinan City, Shandong Province Shizhong District Three tunnel Shandong Ankang garden cell 2-1-101- of mountains and rivers street overline bridge north neighbourhood committee latitude>Jinan City, Shandong Province Shizhong District mountains and rivers street day Three tunnel Shandong Ankang garden * * * * 1-101 of Qiao Bei neighbourhood committees latitude.

Invention content

In view of this, the invention discloses a kind of processing method and processing device of data desensitization, pass through before data desensitization It calls participle benchmark dictionary to segment target data, realizes significantly more efficient data desensitization.

In order to achieve the above-mentioned object of the invention, specific technical solution provided by the invention is as follows：

A kind of processing method of data desensitization, including：

Determine the type of target data；

The corresponding sub- dictionary in participle benchmark dictionary is called according to the type of the target data, and is used and the target The corresponding segmenting method of type of data is segmented；

According to the length of the type of the target data and the target data, the desensitization side of the target data is determined Method, and the sensitive data obtained after being segmented to the target data using the desensitization method of the target data is carried out at desensitization Reason.

Optionally, the method further includes：

Structure participle benchmark dictionary, the participle benchmark dictionary includes multiple sub- dictionaries, and every sub- dictionary respectively includes A type of sensitive word.

Optionally, when the type of the target data is electricity consumption address, the type tune according to the target data With the corresponding sub- dictionary in participle benchmark dictionary, divided using segmenting method corresponding with the type of the target data Word, including：

Call the sub- dictionary of general address, the sub- dictionary of name dictionary, cell name and administrative division diversity zygote dictionary, adopt The target data is segmented with maximum forward matching Chinese word segmentation.

Optionally, when the type of the target data is enterprise-class name in an account book, the type according to the target data The corresponding sub- dictionary in participle benchmark dictionary is called, is divided using segmenting method corresponding with the type of the target data Word, including：

The sub- dictionary of regional ensemble, industry collection zygote dictionary and company organization is called to collect zygote dictionary, using two-way maximum It is segmented with Chinese word cutting method.

Optionally, in the length of the type according to the target data and the target data, the target is determined Before the desensitization method of data, the method further includes：

Calculate the accuracy of the word segmentation result of the target data；

Judge whether the accuracy of the word segmentation result of the target data is more than the first preset value；

If so, executing the length of the type and the target data according to the target data, the target is determined The desensitization method of data；

If it is not, being segmented to the target data based on hidden markov model, and execute described according to the target The length of the type of data and the target data determines the desensitization method of the target data.

Optionally, when the type of the target data be electricity consumption address when, the type according to the target data and The length of the target data determines the desensitization method of the target data, and using the desensitization method pair of the target data The sensitive data obtained after the target data participle carries out desensitization process, including：

Judge whether the length of the target data is more than the second preset value；

When the length of the target data is more than second preset value, determine that the desensitization method of the target data is First electricity consumption address date desensitization method；

Using the first station address data desensitization method, number is extracted from the word segmentation result of the target data Last 5 data and provinces and cities' district data of data, obtain remainder data；

Rear 5 data and provinces and cities district data for retaining the doorplate number, to the residue of the target data Partial data carries out mask, obtains the data after the target data desensitization；

When the length of the target data is not more than second preset value, the desensitization method of the target data is determined For the second electricity consumption address date desensitization method；

Using the second user address date desensitization method, protected by the first sublevel ladder according to the length of the target data The member-retaining portion of target data described in Rule Extraction is stayed, and mask is carried out to the remainder of the target data, is obtained described Data after target data desensitization.

Optionally, when the type of the target data is enterprise-class name in an account book, the type according to the target data With the length of the target data, the desensitization method of the target data is determined, and using the desensitization method of the target data The sensitive data obtained after being segmented to the target data carries out desensitization process, including：

Judge whether the length of the target data is more than third preset value；

When the length of the target data is more than the third preset value, determine that the desensitization method of the target data is First enterprise-class name in an account book data desensitization method；

Using the first enterprise-class name in an account book data desensitization method, font size is extracted from the word segmentation result of the target data The first character of data and the last character of industry data obtain the remaining data of the font size data and the industry data Remaining data；

The remaining data of remaining data and the industry data to the font size data carries out mask, retains the target Other data of data obtain the data after the target data desensitization；

When the length of the target data is not more than the third preset value, the desensitization method of the target data is determined For the second enterprise-class name in an account book data desensitization method；

Using the second enterprise-class name in an account book data desensitization method, according to the length of the target data by the second sublevel ladder Retention discipline extracts the member-retaining portion of the target data, and carries out mask to the remainder of the target data, obtains institute State the data after target data desensitization.

A kind of processing unit of data desensitization, including：

Type determining units, the type for determining target data；

First participle processing unit, for calling the corresponding son in participle benchmark dictionary according to the type of the target data Dictionary, and segmented using segmenting method corresponding with the type of the target data；

Desensitization process unit, for the length according to the type and the target data of the target data, determine described in The desensitization method of target data, and the sensitivity obtained after being segmented to the target data using the desensitization method of the target data Data carry out desensitization process.

Optionally, described device further includes：

Dictionary construction unit, for building participle benchmark dictionary, the participle benchmark dictionary includes multiple sub- dictionaries, often A sub- dictionary respectively includes a type of sensitive word.

Optionally, when the type of the target data is electricity consumption address, the first participle processing unit is specifically used for：

Optionally, when the type of the target data is enterprise-class name in an account book, the first participle processing unit is specifically used In：

Optionally, described device further includes：

Computing unit, the accuracy of the word segmentation result for calculating the target data；

End member is judged, for judging whether the accuracy of the word segmentation result of the target data is more than the first preset value；

If so, triggering the desensitization process unit；

If it is not, the second word segmentation processing unit of triggering, the second word segmentation processing unit, for being based on hidden markov model The target data is segmented, and triggers the desensitization process unit.

Optionally, when the type of the target data is electricity consumption address, the desensitization process unit includes：

First judgment sub-unit, for judging whether the length of the target data is more than the second preset value；

First determination subelement, described in when the length of the target data is more than second preset value, determining The desensitization method of target data is the first electricity consumption address date desensitization method；

First extraction subelement, for using the first station address data desensitization method, from the target data Last 5 data and provinces and cities' district data that doorplate number is extracted in word segmentation result, obtain remainder data；

First desensitization process subelement, rear 5 data for retaining the doorplate number and provinces and cities district number According to carrying out mask to the remainder data of the target data, obtain the data after the target data desensitization；

Second determination subelement, for when the length of the target data is not more than second preset value, determining institute The desensitization method for stating target data is the second electricity consumption address date desensitization method；

Second desensitization process subelement, for using the second user address date desensitization method, according to the target The length of data is extracted the member-retaining portion of the target data by the first sublevel ladder retention discipline, and is remained to the target data Remaining part divides carry out mask, obtains the data after the target data desensitization.

Optionally, when the type of the target data is enterprise-class name in an account book, the desensitization process unit includes：

Second judgment sub-unit, for judging whether the length of the target data is more than third preset value；

Third determination subelement, described in when the length of the target data is more than the third preset value, determining The desensitization method of target data is the first enterprise-class name in an account book data desensitization method；

Second extraction subelement, for using the first enterprise-class name in an account book data desensitization method, from the target data Word segmentation result in extraction font size data first character and industry data the last character, obtain the surplus of the font size data The remaining data of remainder evidence and the industry data；

Third desensitization process subelement, the remainder for remaining data and the industry data to the font size data According to mask is carried out, retain other data of the target data, obtains the data after the target data desensitization；

4th determination subelement, for when the length of the target data is not more than the third preset value, determining institute The desensitization method for stating target data is the second enterprise-class name in an account book data desensitization method；

4th desensitization process subelement, for using the second enterprise-class name in an account book data desensitization method, according to the mesh The length of mark data is extracted the member-retaining portion of the target data by the second sublevel ladder retention discipline, and to the target data Remainder carries out mask, obtains the data after the target data desensitization.

Compared with the existing technology, beneficial effects of the present invention are as follows：

A kind of processing method and processing device of data desensitization provided by the invention, base is segmented before data desensitization by calling Quasi- dictionary segments target data, obtains the data with certain structure, to there are the progress of the part of sensitive prime information Desensitization process improves the validity of data desensitization to the wholly or largely carry out mask of sensitive information.According to target data Type call corresponding sub- dictionary in participle benchmark dictionary, and carried out using segmenting method corresponding with the type of target data Participle, improves the accuracy of participle, and the desensitization method of target data is determined according to the type of target data and length, realizes The differentiation desensitization of different type different length data, improves the validity of data desensitization.

Description of the drawings

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 is a kind of process flow figure of data desensitization disclosed by the embodiments of the present invention；

Fig. 2 is the sub- dictionary schematic diagram of general address disclosed by the embodiments of the present invention；

Fig. 3 is the sub- dictionary schematic diagram of ground disclosed by the embodiments of the present invention thesaurus；

Fig. 4 is the sub- dictionary schematic diagram of cell name disclosed by the embodiments of the present invention；

Fig. 5 is administrative division diversity zygote dictionary schematic diagram disclosed by the embodiments of the present invention；

Fig. 6 is the sub- dictionary schematic diagram of regional ensemble disclosed by the embodiments of the present invention；

Fig. 7 is industry collection zygote dictionary schematic diagram disclosed by the embodiments of the present invention；

Fig. 8 is that company organization disclosed by the embodiments of the present invention collects zygote dictionary schematic diagram；

Fig. 9 is that maximum forward disclosed by the embodiments of the present invention matches Chinese word cutting method schematic diagram；

Figure 10 is electricity consumption address date desensitization process method flow diagram disclosed by the embodiments of the present invention；

Figure 11 is enterprise-class name in an account book data desensitization process method flow diagram disclosed by the embodiments of the present invention；

Figure 12 is the process flow figure of another data desensitization disclosed by the embodiments of the present invention；

Figure 13 is a kind of processing device structure diagram of data desensitization disclosed by the embodiments of the present invention.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Referring to Fig. 1, present embodiment discloses a kind of processing method of data desensitization, following steps are specifically included：

S101：Determine the type of target data；

Target data is the data for needing to carry out desensitization process, the type of target data may include telephone type data, Location class data, username data, bank account class data etc..

S102：The corresponding sub- dictionary in participle benchmark dictionary is called according to the type of the target data, and is used and institute The corresponding segmenting method of type for stating target data is segmented；

Participle is that a Chinese character sequence is cut into individual word one by one.Participle is by continuous word sequence according to one Fixed specification is reassembled into the process of word sequence.

In order to more accurately be segmented to target data, called in participle benchmark dictionary according to the type of target data Corresponding sub- dictionary segments target data.

It should be noted that the processing method of the data desensitization further includes：

Structure participle benchmark dictionary.

The participle benchmark dictionary includes multiple sub- dictionaries, and every sub- dictionary respectively includes a type of sensitive word.

Please refer to Fig. 2~8, respectively segment the sub- dictionary of general address in benchmark dictionary, name dictionary, cell name Sub- dictionary, administrative division diversity zygote dictionary, the sub- dictionary of regional ensemble, industry collection zygote dictionary and company organization collect zygote word Library.

In order to more accurately be segmented to target data, participle benchmark dictionary is called according to the type of the target data In corresponding sub- dictionary, segmented using segmenting method corresponding with the type of the target data.For example, working as the mesh When the type for marking data is electricity consumption address, call the sub- dictionary of general address, the sub- dictionary of name dictionary, cell name and administrative area Collection zygote dictionary is divided, the target data is segmented using maximum forward matching Chinese word segmentation.When the target data Type when being enterprise-class name in an account book, call the sub- dictionary of regional ensemble, industry collection zygote dictionary and company organization to collect zygote dictionary, adopt It is segmented with the two-way maximum Chinese word cutting method that matches.

As shown in figure 9, matching Chinese Word Automatic Segmentation using maximum forward when electricity consumption address date is segmented, specific algorithm is such as Under：

Several continuation characters in target data are matched with vocabulary from left to right, if matched, are syncopated as one Word.But there are one problems here：Accomplish maximum matching, is not that be matched to can cutting for the first time.Such as wait for participle text This：

Content []={ " flood ", " mountain ", " street ", " road ", " double ", " river ", " society ", " area " ... ... }

Vocabulary：Dict []=" Changsha ", " Kaifu District ", " Hong Shan ", " streets Hong Shan " ...

(1) since content [1], when scanning is to [2] content, find " Hong Shan " in vocabulary dict [] suffers.But it can't cut out, because we do not know that subsequent word can form longer word (maximum With)；

(2) content [3] is continued to scan on, it is found that " streets Hong Shan " is not the word in dict [].But we can't be true Fixed whether " Hong Shan " that front is found has been the largest word, because " streets Hong Shan " is the prefix of [2] dict；

(3) scanning content [4] has found that " streets Hong Shan " is the word in dict [].It continues to scan on down；

(4) when scanning [5] content, it is found that " streets Hong Shan are double " are not the word in vocabulary, nor word Prefix.Therefore the maximum word in front can be syncopated as --- " streets Hong Shan ".

It can be seen that the maximum word matched must assure that next scanning is not that the prefix of the word or word in vocabulary just may be used To terminate.It using maximum forward matching algorithm, continues cycling through, completes remaining participle.Such as " the Changsha Kaifu District streets Hong Shan Shuan He The last word segmentation result of No. 199 three phase of the ten thousand state cities present age, 10 this address of Unit 2 1706 " in the community West Roads Fu Yuan is as follows：

When enterprise-class name in an account book data are segmented using two-way maximum matching Chinese word cutting method.Two-way maximum matching Chinese point Word method carries out maximum forward matching and maximum reverse matching Chinese word segmentation respectively first, is carried out on this basis to word segmentation result Compare, according to different results use different participle strategies, such as can according to bulky grain degree word The more the better, non-dictionary word With the more fewer better principle of monosyllabic word, the output of one of which word segmentation result is chosen.

Maximum forward matching Chinese Word Automatic Segmentation has been described in.Maximum reverse matching Chinese Word Automatic Segmentation with it is maximum just Similar to matching algorithm, the difference is that the direction scanned, it is to turn left that substring is taken to be matched from the right side.Algorithm flow can describe For：

(1) input sentence content to be segmented after pretreatment, and initialize index=content.length；

(2) length of each sub- dictionary in dictionary database is obtained；

(3) length of participle word is obtained, and is compared with longest sub- dictionary in dictionary database, most such as fruit dictionary Long length is more than the length to be segmented, then it is maximum length to take and left in the character string to be segmented, and is otherwise then segmented with maximum length；

(4) binary search sub- dictionary identical with current maximum matching length is used, turns (5) if finding the dictionary, Otherwise maximum length subtracts one turn (4)；

(5) the character string SubStr to be segmented is obtained, the character string is looked in dictionary, adds the character string if finding It is added in List, judges whether SubStr is more than 1 if not finding, if it is greater than 1, then delete SubStr the last character Turn (5), otherwise set cutting mark, turns (6)；

(6) judge whether Index is more than 1, otherwise preserve List if it is less than (3) are then turned, exit.

Forward direction matching is combined together by self-reinforcing in double directions with reverse matching algorithm, first for character string to be divided It is first segmented respectively with maximum forward matching and maximum reverse matching algorithm, word segmentation result is compared, it is more positive With reversed two maximum matchings, word segmentation result is returned；When the word segmentation result of both direction is consistent, return string when inconsistent, It is small to return to length；When length is consistent, return reversed.Steps are as follows for two-way maximum matching Chinese Word Automatic Segmentation：

(1) sentence content to be segmented is inputted；

(2) it is carried out respectively with maximum forward matching algorithm and maximum reverse matching algorithm after being pre-processed to content Participle, is compared word segmentation result, turns (3) if word segmentation result is identical, turn if word segmentation result difference (4)；

(3) a kind of word segmentation result is arbitrarily selected, word segmentation result output algorithm is terminated；

(4) whether identical compare participle number, if the same choose reverse word segmentation result, word segmentation result is exported, calculate Method terminates；Otherwise it chooses the smaller word segmentation result of participle number to be exported, algorithm terminates.

S103：According to the length of the type of the target data and the target data, the de- of the target data is determined Quick method, and the sensitive data obtained after being segmented to the target data using the desensitization method of the target data is desensitized Processing.

Referring to Fig. 10, when the type of the target data is electricity consumption address, the implementation procedure of S103 is as follows：

S201：Judge whether the length of the target data is more than the second preset value；If executing S202, execute if not S203：

S202：Determine that the desensitization method of the target data is the first electricity consumption address date desensitization method；

S204：Using the first station address data desensitization method, extracted from the word segmentation result of the target data Last 5 data and provinces and cities' district data of doorplate number, obtain remainder data；

S205：Rear 5 data and provinces and cities district data for retaining the doorplate number, to the target data Remainder data carry out mask, obtain the data after the target data desensitization；

S203：Determine that the desensitization method of the target data is the second electricity consumption address date desensitization method；

S206：Using the second user address date desensitization method, first point is pressed according to the length of the target data Ladder retention discipline extracts the member-retaining portion of the target data, and carries out mask to the remainder of the target data, obtains Data after desensitizing to the target data.

For example, being carried out by second user address date desensitization method for 10 words of length and electricity consumption address date below Data desensitize, and retain by length sublevel ladder, 5 words of length and below, the 1st word of reservation and last 2 words；Length 6-9 Word, retain last 5 words.

Data are carried out by the first station address data desensitization method for the electricity consumption address date of 10 words of length or more Desensitization.Electricity consumption address is generally made of province, city, district, street/small towns neighbourhood committee/village, road, cell, number part.Door Trade mark part retains last 5, and province, city, district retain, and other parts are all replaced with *.As follows：

Three tunnel Shandong Ankang garden cell 2-1-101- of Jinan City, Shandong Province Shizhong District mountains and rivers street overline bridge north neighbourhood committee latitude>Mountain The Jinan Cities Dong Sheng Shizhong District * * * * * * * * * * * * * * * * * * * * * * 1-101.

1 is please referred to Fig.1, when the type of the target data is electricity consumption address, the implementation procedure of S103 is as follows：

S301：Judge whether the length of the target data is more than third preset value；If so, executing S302, execute if not S303；

S302：Determine that the desensitization method of the target data is the first enterprise-class name in an account book data desensitization method；

S304：Using the first enterprise-class name in an account book data desensitization method, carried from the word segmentation result of the target data The first character of font size data and the last character of industry data are taken, the remaining data of the font size data and the row are obtained The remaining data of industry data；

S305：The remaining data of remaining data and the industry data to the font size data carries out mask, retains institute Other data for stating target data obtain the data after the target data desensitization；

S303：Determine that the desensitization method of the target data is the second enterprise-class name in an account book data desensitization method；

S306：Using the second enterprise-class name in an account book data desensitization method, second is pressed according to the length of the target data Sublevel ladder retention discipline extracts the member-retaining portion of the target data, and carries out mask to the remainder of the target data, Obtain the data after the target data desensitization.

For example, being carried out by the second electricity consumption address date desensitization method for 6 words of length enterprise-class name in an account book data below Data desensitize, and retain by length sublevel ladder, 4 words of length and below, head and the tail 1 word of each reservation；5-6 word of length, it is first Tail respectively retains 2 words.

Data are carried out by the first electricity consumption address date desensitization method for the enterprise-class name in an account book data of 6 words of length or more Desensitization.Enterprise-class name in an account book is generally made of region, font size, industry, four part of company organization.Retain front and back region and organization department It is point constant, mask operation is carried out to font size and industry.Font size part retains first character, and other parts are all replaced with *；Industry Part retains the last character, and other parts are all replaced with *.As follows：

Qingdao Hui Feng Motor Manufacturing Co. Ltds->Qingdao favour * * * * make Co., Ltd；

2020 commerce services Co., Ltd of Qingdao->Two * * * * * * business Co., Ltd of Qingdao.

A kind of processing method of data desensitization, benchmark word is segmented before data desensitization by calling disclosed in the present embodiment Library segments target data, obtains the data with certain structure, to there are the parts of sensitive prime information to desensitize Processing improves the validity of data desensitization to the wholly or largely carry out mask of sensitive information.According to the class of target data Type calls corresponding sub- dictionary in participle benchmark dictionary, and is divided using segmenting method corresponding with the type of target data Word improves the accuracy of participle, and the desensitization method of target data is determined according to the type of target data and length, realizes The differentiation of different type different length data desensitizes, and improves the validity of data desensitization.

2 are please referred to Fig.1, present embodiment discloses the processing methods of another data desensitization, specifically include following steps：

S401：Determine the type of target data；

S402：The corresponding sub- dictionary in participle benchmark dictionary is called according to the type of the target data, and is used and institute The corresponding segmenting method of type for stating target data is segmented；

S403：Calculate the accuracy of the word segmentation result of the target data；

S404：Judge whether the accuracy of the word segmentation result of the target data is more than the first preset value；If so, executing S405, if it is not, executing S406；

S405：According to the length of the type of the target data and the target data, the de- of the target data is determined Quick method, and the sensitive data obtained after being segmented to the target data using the desensitization method of the target data is desensitized Processing；

S406：The target data is segmented based on hidden markov model, and executes S405.

Using hidden markov model (HMM Hidden Markov Model) to two class of enterprise-class name in an account book and electricity consumption address Data carry out Chinese word segmentation processing.HMM algorithms, can be in the case where training corpus scale is sufficiently large and Covering domain is enough Obtain higher cutting accuracy.This kind of segmentation methods model Chinese based on the part of speech and statistical nature that manually mark, Model parameter is estimated to train according to the data (language material marked) observed.Pass through model again in the participle stage The probability that various participles occur is calculated, using the word segmentation result of maximum probability as final result.Common sequence labelling model is just There are HMM algorithms, which can handle ambiguity and unregistered word problem well, and effect ratio is based on string matching effect more It is good.

Hidden markov model is a dual random process, we do not know specific status switch, only know state The probability of transfer, i.e. the state conversion process of model is not observable (hidden), and the random process of the event of observable It is the random function of hidden state conversion process.

The composition of HMM includes：

Status number in model is N；

The different symbolic number M that may be exported from each state；

State transition probability matrix A=a_ij, wherein a_ijFor state S_iIt is transferred to state S_jProbability；

From state C_jObserve a certain special symbol O_kProbability distribution matrix be：B=b_j(k), the probability of symbol is observed again Claim symbol emission probability；

The probability distribution of original state is：π={ π_i}。

Usually, a HMM is denoted as a five-tuple μ=(C, K, A, B, π), wherein C is the set of state, and O is output The set of symbol, π, A and B are probability distribution, state transition probability and the symbol emission probability of original state respectively.

Chinese word segmentation is using language material to training HMM.Using classical character label model, the set C of four class labels is C ={ B, E, M, S }, meaning is as follows：

B：The beginning of one word

E：The end of one word

M：The centre of one word

S：Individual character is at word

After being marked with four class labels, so that it may to start method one HMM model of structure with statistics, each character Labeling is only influenced by previous character classification.Acquire the state-transition matrix A and symbol emission probability B of HMM.Its In：

C={ B, E, M, S } in formula, O={ character set }, Count represents frequency.Calculating B_ijWhen, due to data Sparsity, many characters do not appear in training set, this causes probability to be that 0 result appears in B, is asked to repair this Topic, using the data smoothing technology for adding 1, i.e.,：

We set initial vector π={ 0.5,0.0,0.0,0.5 }, and M and E can not possibly appear in the first place of sentence.So far, HMM model structure finishes.Based on this HMM model, for an observation sequence, a hiding sequence is obtained with Viterbi algorithm It arranges { B, E, M, S }.

Viterbi searching algorithms are：

1, it initializes：δ₁(i)=π_ib_i(O1),1≤i≤N,

The path variable of maximum probability：

2, recursive calculation：

3, memory rollback path：

4, it terminates：

Path (status switch) is obtained by backtracking：

The time complexity of Viterbi algorithm is O (N²T).Such as " the Changsha Kaifu District streets the Hong Shan communities Shuan He good fortune member west The output state sequence of No. 199 three phase of the ten thousand state cities present age, 10 this address of Unit 2 1706 " in road is：

“BMEBMEBMMEBMMEBMMEBMMEBMMMMMEBMEBMEBMME”

Can carry out Chinese Word Segmentation according to this status switch is：

Last Chinese Word Segmentation result is as follows:

The processing method that data disclosed in the present embodiment desensitize uses the maximum forward matching that algorithm complexity is smaller first Method or two-way maximum matching Chinese word cutting method segment target and carry out word segmentation processing, ensure that the processing speed of word segmentation processing Degree.The accuracy of word segmentation result is calculated, when word segmentation result accuracy be less than threshold value when using algorithm complexity it is higher but Also higher hidden markov model segments target data to participle accuracy rate, ensure that the accuracy of word segmentation result.

Based on a kind of processing method of data desensitization disclosed in above-described embodiment, 3 are please referred to Fig.1, the present embodiment corresponds to public A kind of processing unit of data desensitization has been opened, including：

Type determining units 501, the type for determining target data；

First participle processing unit 502, for calling the phase in participle benchmark dictionary according to the type of the target data Sub- dictionary is answered, and is segmented using segmenting method corresponding with the type of the target data；

Desensitization process unit 503 is used for the length of the type and the target data according to the target data, determines institute The desensitization method of target data is stated, and is obtained after being segmented to the target data using the desensitization method of the target data quick Feel data and carries out desensitization process.

Optionally, described device further includes：

Optionally, when the type of the target data is electricity consumption address, the first participle processing unit 502 is specifically used In：

Optionally, when the type of the target data is enterprise-class name in an account book, the first participle processing unit 502 is specific For：

Optionally, described device further includes：

If so, triggering the desensitization process unit；

Optionally, when the type of the target data is electricity consumption address, the desensitization process unit 503 includes：

Optionally, when the type of the target data is enterprise-class name in an account book, the desensitization process unit 503 includes：

A kind of processing unit of data desensitization, benchmark word is segmented before data desensitization by calling disclosed in the present embodiment Library segments target data, obtains the data with certain structure, to there are the parts of sensitive prime information to desensitize Processing improves the validity of data desensitization to the wholly or largely carry out mask of sensitive information.According to the class of target data Type calls corresponding sub- dictionary in participle benchmark dictionary, and is divided using segmenting method corresponding with the type of target data Word improves the accuracy of participle, and the desensitization method of target data is determined according to the type of target data and length, realizes The differentiation of different type different length data desensitizes, and improves the validity of data desensitization.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest range caused.

Claims

1. a kind of processing method of data desensitization, which is characterized in that including：

Determine the type of target data；

The corresponding sub- dictionary in participle benchmark dictionary is called according to the type of the target data, and is used and the target data The corresponding segmenting method of type segmented；

According to the length of the type of the target data and the target data, the desensitization method of the target data is determined, and The sensitive data obtained after being segmented to the target data using the desensitization method of the target data carries out desensitization process.

2. according to the method described in claim 1, it is characterized in that, the method further includes：

Structure participle benchmark dictionary, the participle benchmark dictionary includes multiple sub- dictionaries, and every sub- dictionary respectively includes one kind The sensitive word of type.

3. according to the method described in claim 1, it is characterized in that, when the type of the target data be electricity consumption address when, institute The corresponding sub- dictionary called according to the type of the target data in participle benchmark dictionary is stated, using the class with the target data The corresponding segmenting method of type is segmented, including：

Call the sub- dictionary of general address, the sub- dictionary of name dictionary, cell name and administrative division diversity zygote dictionary, using most Big positive matching Chinese word segmentation segments the target data.

4. according to the method described in claim 1, it is characterized in that, when the type of the target data be enterprise-class name in an account book when, It is described according to the type of the target data call participle benchmark dictionary in corresponding sub- dictionary, using with the target data The corresponding segmenting method of type is segmented, including：

The sub- dictionary of regional ensemble, industry collection zygote dictionary and company organization is called to collect zygote dictionary, using in two-way maximum matching Literary segmenting method is segmented.

5. according to the method described in claim 1, it is characterized in that, in the type according to the target data and the mesh The length for marking data, before the desensitization method for determining the target data, the method further includes：

Calculate the accuracy of the word segmentation result of the target data；

If so, executing the length of the type and the target data according to the target data, the target data is determined Desensitization method；

If it is not, being segmented to the target data based on hidden markov model, and execute described according to the target data Type and the target data length, determine the desensitization method of the target data.

6. according to the method described in claim 1, it is characterized in that, when the type of the target data be electricity consumption address when, institute The length for stating the type and the target data according to the target data, determines the desensitization method of the target data, and adopts The sensitive data obtained after being segmented to the target data with the desensitization method of the target data carries out desensitization process, including：

Using the first station address data desensitization method, doorplate number is extracted from the word segmentation result of the target data Last 5 data and provinces and cities' district data, obtain remainder data；

Rear 5 data and provinces and cities district data for retaining the doorplate number, to the remainder of the target data Data carry out mask, obtain the data after the target data desensitization；

When the length of the target data is not more than second preset value, determine that the desensitization method of the target data is the Two electricity consumption address date desensitization methods；

Using the second user address date desensitization method, rule are retained by the first sublevel ladder according to the length of the target data The member-retaining portion of the target data is then extracted, and mask is carried out to the remainder of the target data, obtains the target Data after data desensitization.

7. according to the method described in claim 1, it is characterized in that, when the type of the target data be enterprise-class name in an account book when, The length of the type and the target data according to the target data, determines the desensitization method of the target data, and The sensitive data obtained after being segmented to the target data using the desensitization method of the target data carries out desensitization process, packet It includes：

Judge whether the length of the target data is more than third preset value；

Using the first enterprise-class name in an account book data desensitization method, font size data are extracted from the word segmentation result of the target data First character and industry data the last character, obtain the font size data remaining data and the industry data it is surplus Remainder evidence；

The remaining data of remaining data and the industry data to the font size data carries out mask, retains the target data Other data, obtain the data after target data desensitization；

When the length of the target data is not more than the third preset value, determine that the desensitization method of the target data is the Two enterprise-class name in an account book data desensitization methods；

Using the second enterprise-class name in an account book data desensitization method, retained by the second sublevel ladder according to the length of the target data The member-retaining portion of target data described in Rule Extraction, and mask is carried out to the remainder of the target data, obtain the mesh Mark the data after data desensitization.

8. a kind of processing unit of data desensitization, which is characterized in that including：

Type determining units, the type for determining target data；

First participle processing unit, for calling the corresponding sub- word in participle benchmark dictionary according to the type of the target data Library, and segmented using segmenting method corresponding with the type of the target data；

Desensitization process unit is used for the length of the type and the target data according to the target data, determines the target The desensitization method of data, and the sensitive data obtained after being segmented to the target data using the desensitization method of the target data Carry out desensitization process.

9. device according to claim 8, which is characterized in that described device further includes：

Dictionary construction unit, for building participle benchmark dictionary, the participle benchmark dictionary includes multiple sub- dictionaries, per height Dictionary respectively includes a type of sensitive word.

10. device according to claim 8, which is characterized in that when the type of the target data is electricity consumption address, institute First participle processing unit is stated to be specifically used for：

11. device according to claim 8, which is characterized in that when the type of the target data is enterprise-class name in an account book, The first participle processing unit is specifically used for：

12. device according to claim 8, which is characterized in that described device further includes：

If so, triggering the desensitization process unit；

If it is not, the second word segmentation processing unit of triggering, the second word segmentation processing unit, for being based on hidden markov model to institute It states target data to be segmented, and triggers the desensitization process unit.

13. device according to claim 8, which is characterized in that when the type of the target data is electricity consumption address, institute Stating desensitization process unit includes：

First determination subelement, for when the length of the target data is more than second preset value, determining the target The desensitization method of data is the first electricity consumption address date desensitization method；

First extraction subelement, for using the first station address data desensitization method, from the participle of the target data As a result last 5 data and provinces and cities' district data of extraction doorplate number, obtain remainder data in；

First desensitization process subelement, rear 5 data for retaining the doorplate number and provinces and cities district data are right The remainder data of the target data carry out mask, obtain the data after the target data desensitization；

Second determination subelement, for when the length of the target data is not more than second preset value, determining the mesh The desensitization method for marking data is the second electricity consumption address date desensitization method；

Second desensitization process subelement, for using the second user address date desensitization method, according to the target data Length the member-retaining portion of the target data is extracted by the first sublevel ladder retention discipline, and to the remainder of the target data Divide carry out mask, obtains the data after the target data desensitization.

14. according to the method described in claim 8, it is characterized in that, when the type of the target data be enterprise-class name in an account book when, The desensitization process unit includes：

Third determination subelement, for when the length of the target data is more than the third preset value, determining the target The desensitization method of data is the first enterprise-class name in an account book data desensitization method；

Second extraction subelement, for using the first enterprise-class name in an account book data desensitization method, from point of the target data The last character that the first character and industry data of font size data are extracted in word result, obtains the remainder of the font size data According to the remaining data with the industry data；

Third desensitization process subelement, for the remaining data to the remaining datas of the font size data and the industry data into Row mask retains other data of the target data, obtains the data after the target data desensitization；

4th determination subelement, for when the length of the target data is not more than the third preset value, determining the mesh The desensitization method for marking data is the second enterprise-class name in an account book data desensitization method；

4th desensitization process subelement, for using the second enterprise-class name in an account book data desensitization method, according to the number of targets According to length extract by the second sublevel ladder retention discipline the member-retaining portion of the target data, and to the residue of the target data Part carries out mask, obtains the data after the target data desensitization.