CN108776762A - A kind of processing method and processing device of data desensitization - Google Patents
A kind of processing method and processing device of data desensitization Download PDFInfo
- Publication number
- CN108776762A CN108776762A CN201810586230.9A CN201810586230A CN108776762A CN 108776762 A CN108776762 A CN 108776762A CN 201810586230 A CN201810586230 A CN 201810586230A CN 108776762 A CN108776762 A CN 108776762A
- Authority
- CN
- China
- Prior art keywords
- data
- target data
- desensitization
- dictionary
- type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioethics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
This application provides a kind of processing method and processing devices of data desensitization, determine the type of target data;The corresponding sub- dictionary in participle benchmark dictionary is called according to the type of the target data, and is segmented using segmenting method corresponding with the type of the target data;According to the length of the type of the target data and the target data, the desensitization method of the target data is determined, and the sensitive data obtained after being segmented to the target data using the desensitization method of the target data carries out desensitization process.By being segmented to obtain the data with certain structure to target data; to there are the parts of sensitive prime information to carry out desensitization process; to the wholly or largely carry out mask of sensitive information; improve the validity of data desensitization; ensure data assets safety; the safety for utmostly protecting customer information avoids customer information caused by the modes such as improper inquiry, export from revealing.
Description
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of processing method and processing device of data desensitization.
Background technology
To implement country《Network security method》About the job requirement of protection client-aware information, power marketing client is ensured
Data assets safety, ensures power marketing client's legitimate rights and interests, needs to carry out data desensitization to power marketing client-aware information,
Purpose is utmostly to protect the safety of electricity customer information while meeting regular traffic and needing, avoid improper inquiry,
Electricity customer information is revealed caused by the modes such as export.
The main rule of power marketing data desensitization at present mainly uses mask desensitization method, member-retaining portion information to ensure letter
The length of breath is constant, and main rule is as follows:
(1) contact addresses
Format:Format is not fixed, and is the character string of random length.
Desensitization rule:Retain by length sublevel ladder, 5 words of length and below, the 1st word of reservation and last 2 words;It is long
6-9 word of degree, retain last 5 words;Length is 10 words or more, conceals 4 words before last 5 words;It hides
Word is replaced with *.
(2) enterprise-class name in an account book
Format:Enterprise-class name in an account book is consistent with business license, is Business Name, is made of several Chinese characters.
Desensitization rule:Retain by length sublevel ladder:4 words of length and below, head and the tail 1 word of each reservation;Length 5-6
Word, head and the tail respectively retain 2 words;7 words of length and the above odd number conceal intermediate 3 words;8 words of length and the above even number, it is hidden
Remove intermediate 4 words;Word is hidden to be replaced with *.
The major defect of existing power marketing data desensitization rule is:
Electricity consumption address and this two classes power marketing data of enterprise-class family carry out data desensitization according to current data desensitization rule
Afterwards, non-keyword mask, and keyword also maintains.For example, according to the desensitization rule of enterprise-class name in an account book, the name in an account book after desensitization
Still there may be sensitive information, partial key is retained for address, desensitization effect unobvious.As follows:Qingdao Hui Feng
Motor Manufacturing Co. Ltd->Qingdao Hui Feng * * * * Co., Ltds;2020 commerce services Co., Ltd of Qingdao->Qingdao
Two * * * * * * business Co., Ltds.
According to the desensitization rule of contact addresses, there is also similar problems, as follows:Jinan City, Shandong Province Shizhong District
Three tunnel Shandong Ankang garden cell 2-1-101- of mountains and rivers street overline bridge north neighbourhood committee latitude>Jinan City, Shandong Province Shizhong District mountains and rivers street day
Three tunnel Shandong Ankang garden * * * * 1-101 of Qiao Bei neighbourhood committees latitude.
Invention content
In view of this, the invention discloses a kind of processing method and processing device of data desensitization, pass through before data desensitization
It calls participle benchmark dictionary to segment target data, realizes significantly more efficient data desensitization.
In order to achieve the above-mentioned object of the invention, specific technical solution provided by the invention is as follows:
A kind of processing method of data desensitization, including:
Determine the type of target data;
The corresponding sub- dictionary in participle benchmark dictionary is called according to the type of the target data, and is used and the target
The corresponding segmenting method of type of data is segmented;
According to the length of the type of the target data and the target data, the desensitization side of the target data is determined
Method, and the sensitive data obtained after being segmented to the target data using the desensitization method of the target data is carried out at desensitization
Reason.
Optionally, the method further includes:
Structure participle benchmark dictionary, the participle benchmark dictionary includes multiple sub- dictionaries, and every sub- dictionary respectively includes
A type of sensitive word.
Optionally, when the type of the target data is electricity consumption address, the type tune according to the target data
With the corresponding sub- dictionary in participle benchmark dictionary, divided using segmenting method corresponding with the type of the target data
Word, including:
Call the sub- dictionary of general address, the sub- dictionary of name dictionary, cell name and administrative division diversity zygote dictionary, adopt
The target data is segmented with maximum forward matching Chinese word segmentation.
Optionally, when the type of the target data is enterprise-class name in an account book, the type according to the target data
The corresponding sub- dictionary in participle benchmark dictionary is called, is divided using segmenting method corresponding with the type of the target data
Word, including:
The sub- dictionary of regional ensemble, industry collection zygote dictionary and company organization is called to collect zygote dictionary, using two-way maximum
It is segmented with Chinese word cutting method.
Optionally, in the length of the type according to the target data and the target data, the target is determined
Before the desensitization method of data, the method further includes:
Calculate the accuracy of the word segmentation result of the target data;
Judge whether the accuracy of the word segmentation result of the target data is more than the first preset value;
If so, executing the length of the type and the target data according to the target data, the target is determined
The desensitization method of data;
If it is not, being segmented to the target data based on hidden markov model, and execute described according to the target
The length of the type of data and the target data determines the desensitization method of the target data.
Optionally, when the type of the target data be electricity consumption address when, the type according to the target data and
The length of the target data determines the desensitization method of the target data, and using the desensitization method pair of the target data
The sensitive data obtained after the target data participle carries out desensitization process, including:
Judge whether the length of the target data is more than the second preset value;
When the length of the target data is more than second preset value, determine that the desensitization method of the target data is
First electricity consumption address date desensitization method;
Using the first station address data desensitization method, number is extracted from the word segmentation result of the target data
Last 5 data and provinces and cities' district data of data, obtain remainder data;
Rear 5 data and provinces and cities district data for retaining the doorplate number, to the residue of the target data
Partial data carries out mask, obtains the data after the target data desensitization;
When the length of the target data is not more than second preset value, the desensitization method of the target data is determined
For the second electricity consumption address date desensitization method;
Using the second user address date desensitization method, protected by the first sublevel ladder according to the length of the target data
The member-retaining portion of target data described in Rule Extraction is stayed, and mask is carried out to the remainder of the target data, is obtained described
Data after target data desensitization.
Optionally, when the type of the target data is enterprise-class name in an account book, the type according to the target data
With the length of the target data, the desensitization method of the target data is determined, and using the desensitization method of the target data
The sensitive data obtained after being segmented to the target data carries out desensitization process, including:
Judge whether the length of the target data is more than third preset value;
When the length of the target data is more than the third preset value, determine that the desensitization method of the target data is
First enterprise-class name in an account book data desensitization method;
Using the first enterprise-class name in an account book data desensitization method, font size is extracted from the word segmentation result of the target data
The first character of data and the last character of industry data obtain the remaining data of the font size data and the industry data
Remaining data;
The remaining data of remaining data and the industry data to the font size data carries out mask, retains the target
Other data of data obtain the data after the target data desensitization;
When the length of the target data is not more than the third preset value, the desensitization method of the target data is determined
For the second enterprise-class name in an account book data desensitization method;
Using the second enterprise-class name in an account book data desensitization method, according to the length of the target data by the second sublevel ladder
Retention discipline extracts the member-retaining portion of the target data, and carries out mask to the remainder of the target data, obtains institute
State the data after target data desensitization.
A kind of processing unit of data desensitization, including:
Type determining units, the type for determining target data;
First participle processing unit, for calling the corresponding son in participle benchmark dictionary according to the type of the target data
Dictionary, and segmented using segmenting method corresponding with the type of the target data;
Desensitization process unit, for the length according to the type and the target data of the target data, determine described in
The desensitization method of target data, and the sensitivity obtained after being segmented to the target data using the desensitization method of the target data
Data carry out desensitization process.
Optionally, described device further includes:
Dictionary construction unit, for building participle benchmark dictionary, the participle benchmark dictionary includes multiple sub- dictionaries, often
A sub- dictionary respectively includes a type of sensitive word.
Optionally, when the type of the target data is electricity consumption address, the first participle processing unit is specifically used for:
Call the sub- dictionary of general address, the sub- dictionary of name dictionary, cell name and administrative division diversity zygote dictionary, adopt
The target data is segmented with maximum forward matching Chinese word segmentation.
Optionally, when the type of the target data is enterprise-class name in an account book, the first participle processing unit is specifically used
In:
The sub- dictionary of regional ensemble, industry collection zygote dictionary and company organization is called to collect zygote dictionary, using two-way maximum
It is segmented with Chinese word cutting method.
Optionally, described device further includes:
Computing unit, the accuracy of the word segmentation result for calculating the target data;
End member is judged, for judging whether the accuracy of the word segmentation result of the target data is more than the first preset value;
If so, triggering the desensitization process unit;
If it is not, the second word segmentation processing unit of triggering, the second word segmentation processing unit, for being based on hidden markov model
The target data is segmented, and triggers the desensitization process unit.
Optionally, when the type of the target data is electricity consumption address, the desensitization process unit includes:
First judgment sub-unit, for judging whether the length of the target data is more than the second preset value;
First determination subelement, described in when the length of the target data is more than second preset value, determining
The desensitization method of target data is the first electricity consumption address date desensitization method;
First extraction subelement, for using the first station address data desensitization method, from the target data
Last 5 data and provinces and cities' district data that doorplate number is extracted in word segmentation result, obtain remainder data;
First desensitization process subelement, rear 5 data for retaining the doorplate number and provinces and cities district number
According to carrying out mask to the remainder data of the target data, obtain the data after the target data desensitization;
Second determination subelement, for when the length of the target data is not more than second preset value, determining institute
The desensitization method for stating target data is the second electricity consumption address date desensitization method;
Second desensitization process subelement, for using the second user address date desensitization method, according to the target
The length of data is extracted the member-retaining portion of the target data by the first sublevel ladder retention discipline, and is remained to the target data
Remaining part divides carry out mask, obtains the data after the target data desensitization.
Optionally, when the type of the target data is enterprise-class name in an account book, the desensitization process unit includes:
Second judgment sub-unit, for judging whether the length of the target data is more than third preset value;
Third determination subelement, described in when the length of the target data is more than the third preset value, determining
The desensitization method of target data is the first enterprise-class name in an account book data desensitization method;
Second extraction subelement, for using the first enterprise-class name in an account book data desensitization method, from the target data
Word segmentation result in extraction font size data first character and industry data the last character, obtain the surplus of the font size data
The remaining data of remainder evidence and the industry data;
Third desensitization process subelement, the remainder for remaining data and the industry data to the font size data
According to mask is carried out, retain other data of the target data, obtains the data after the target data desensitization;
4th determination subelement, for when the length of the target data is not more than the third preset value, determining institute
The desensitization method for stating target data is the second enterprise-class name in an account book data desensitization method;
4th desensitization process subelement, for using the second enterprise-class name in an account book data desensitization method, according to the mesh
The length of mark data is extracted the member-retaining portion of the target data by the second sublevel ladder retention discipline, and to the target data
Remainder carries out mask, obtains the data after the target data desensitization.
Compared with the existing technology, beneficial effects of the present invention are as follows:
A kind of processing method and processing device of data desensitization provided by the invention, base is segmented before data desensitization by calling
Quasi- dictionary segments target data, obtains the data with certain structure, to there are the progress of the part of sensitive prime information
Desensitization process improves the validity of data desensitization to the wholly or largely carry out mask of sensitive information.According to target data
Type call corresponding sub- dictionary in participle benchmark dictionary, and carried out using segmenting method corresponding with the type of target data
Participle, improves the accuracy of participle, and the desensitization method of target data is determined according to the type of target data and length, realizes
The differentiation desensitization of different type different length data, improves the validity of data desensitization.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis
The attached drawing of offer obtains other attached drawings.
Fig. 1 is a kind of process flow figure of data desensitization disclosed by the embodiments of the present invention;
Fig. 2 is the sub- dictionary schematic diagram of general address disclosed by the embodiments of the present invention;
Fig. 3 is the sub- dictionary schematic diagram of ground disclosed by the embodiments of the present invention thesaurus;
Fig. 4 is the sub- dictionary schematic diagram of cell name disclosed by the embodiments of the present invention;
Fig. 5 is administrative division diversity zygote dictionary schematic diagram disclosed by the embodiments of the present invention;
Fig. 6 is the sub- dictionary schematic diagram of regional ensemble disclosed by the embodiments of the present invention;
Fig. 7 is industry collection zygote dictionary schematic diagram disclosed by the embodiments of the present invention;
Fig. 8 is that company organization disclosed by the embodiments of the present invention collects zygote dictionary schematic diagram;
Fig. 9 is that maximum forward disclosed by the embodiments of the present invention matches Chinese word cutting method schematic diagram;
Figure 10 is electricity consumption address date desensitization process method flow diagram disclosed by the embodiments of the present invention;
Figure 11 is enterprise-class name in an account book data desensitization process method flow diagram disclosed by the embodiments of the present invention;
Figure 12 is the process flow figure of another data desensitization disclosed by the embodiments of the present invention;
Figure 13 is a kind of processing device structure diagram of data desensitization disclosed by the embodiments of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Referring to Fig. 1, present embodiment discloses a kind of processing method of data desensitization, following steps are specifically included:
S101:Determine the type of target data;
Target data is the data for needing to carry out desensitization process, the type of target data may include telephone type data,
Location class data, username data, bank account class data etc..
S102:The corresponding sub- dictionary in participle benchmark dictionary is called according to the type of the target data, and is used and institute
The corresponding segmenting method of type for stating target data is segmented;
Participle is that a Chinese character sequence is cut into individual word one by one.Participle is by continuous word sequence according to one
Fixed specification is reassembled into the process of word sequence.
In order to more accurately be segmented to target data, called in participle benchmark dictionary according to the type of target data
Corresponding sub- dictionary segments target data.
It should be noted that the processing method of the data desensitization further includes:
Structure participle benchmark dictionary.
The participle benchmark dictionary includes multiple sub- dictionaries, and every sub- dictionary respectively includes a type of sensitive word.
Please refer to Fig. 2~8, respectively segment the sub- dictionary of general address in benchmark dictionary, name dictionary, cell name
Sub- dictionary, administrative division diversity zygote dictionary, the sub- dictionary of regional ensemble, industry collection zygote dictionary and company organization collect zygote word
Library.
In order to more accurately be segmented to target data, participle benchmark dictionary is called according to the type of the target data
In corresponding sub- dictionary, segmented using segmenting method corresponding with the type of the target data.For example, working as the mesh
When the type for marking data is electricity consumption address, call the sub- dictionary of general address, the sub- dictionary of name dictionary, cell name and administrative area
Collection zygote dictionary is divided, the target data is segmented using maximum forward matching Chinese word segmentation.When the target data
Type when being enterprise-class name in an account book, call the sub- dictionary of regional ensemble, industry collection zygote dictionary and company organization to collect zygote dictionary, adopt
It is segmented with the two-way maximum Chinese word cutting method that matches.
As shown in figure 9, matching Chinese Word Automatic Segmentation using maximum forward when electricity consumption address date is segmented, specific algorithm is such as
Under:
Several continuation characters in target data are matched with vocabulary from left to right, if matched, are syncopated as one
Word.But there are one problems here:Accomplish maximum matching, is not that be matched to can cutting for the first time.Such as wait for participle text
This:
Content []={ " flood ", " mountain ", " street ", " road ", " double ", " river ", " society ", " area " ... ... }
Vocabulary:Dict []=" Changsha ", " Kaifu District ", " Hong Shan ", " streets Hong Shan " ...
(1) since content [1], when scanning is to [2] content, find " Hong Shan " in vocabulary dict
[] suffers.But it can't cut out, because we do not know that subsequent word can form longer word (maximum
With);
(2) content [3] is continued to scan on, it is found that " streets Hong Shan " is not the word in dict [].But we can't be true
Fixed whether " Hong Shan " that front is found has been the largest word, because " streets Hong Shan " is the prefix of [2] dict;
(3) scanning content [4] has found that " streets Hong Shan " is the word in dict [].It continues to scan on down;
(4) when scanning [5] content, it is found that " streets Hong Shan are double " are not the word in vocabulary, nor word
Prefix.Therefore the maximum word in front can be syncopated as --- " streets Hong Shan ".
It can be seen that the maximum word matched must assure that next scanning is not that the prefix of the word or word in vocabulary just may be used
To terminate.It using maximum forward matching algorithm, continues cycling through, completes remaining participle.Such as " the Changsha Kaifu District streets Hong Shan Shuan He
The last word segmentation result of No. 199 three phase of the ten thousand state cities present age, 10 this address of Unit 2 1706 " in the community West Roads Fu Yuan is as follows:
" Changsha | Kaifu District | the streets Hong Shan | the communities Shuan He | the West Roads Fu Yuan | 199 | number | contemporary three phase of ten thousand state cities | 10 |
| two | unit | 1706 ".
When enterprise-class name in an account book data are segmented using two-way maximum matching Chinese word cutting method.Two-way maximum matching Chinese point
Word method carries out maximum forward matching and maximum reverse matching Chinese word segmentation respectively first, is carried out on this basis to word segmentation result
Compare, according to different results use different participle strategies, such as can according to bulky grain degree word The more the better, non-dictionary word
With the more fewer better principle of monosyllabic word, the output of one of which word segmentation result is chosen.
Maximum forward matching Chinese Word Automatic Segmentation has been described in.Maximum reverse matching Chinese Word Automatic Segmentation with it is maximum just
Similar to matching algorithm, the difference is that the direction scanned, it is to turn left that substring is taken to be matched from the right side.Algorithm flow can describe
For:
(1) input sentence content to be segmented after pretreatment, and initialize index=content.length;
(2) length of each sub- dictionary in dictionary database is obtained;
(3) length of participle word is obtained, and is compared with longest sub- dictionary in dictionary database, most such as fruit dictionary
Long length is more than the length to be segmented, then it is maximum length to take and left in the character string to be segmented, and is otherwise then segmented with maximum length;
(4) binary search sub- dictionary identical with current maximum matching length is used, turns (5) if finding the dictionary,
Otherwise maximum length subtracts one turn (4);
(5) the character string SubStr to be segmented is obtained, the character string is looked in dictionary, adds the character string if finding
It is added in List, judges whether SubStr is more than 1 if not finding, if it is greater than 1, then delete SubStr the last character
Turn (5), otherwise set cutting mark, turns (6);
(6) judge whether Index is more than 1, otherwise preserve List if it is less than (3) are then turned, exit.
Forward direction matching is combined together by self-reinforcing in double directions with reverse matching algorithm, first for character string to be divided
It is first segmented respectively with maximum forward matching and maximum reverse matching algorithm, word segmentation result is compared, it is more positive
With reversed two maximum matchings, word segmentation result is returned;When the word segmentation result of both direction is consistent, return string when inconsistent,
It is small to return to length;When length is consistent, return reversed.Steps are as follows for two-way maximum matching Chinese Word Automatic Segmentation:
(1) sentence content to be segmented is inputted;
(2) it is carried out respectively with maximum forward matching algorithm and maximum reverse matching algorithm after being pre-processed to content
Participle, is compared word segmentation result, turns (3) if word segmentation result is identical, turn if word segmentation result difference (4);
(3) a kind of word segmentation result is arbitrarily selected, word segmentation result output algorithm is terminated;
(4) whether identical compare participle number, if the same choose reverse word segmentation result, word segmentation result is exported, calculate
Method terminates;Otherwise it chooses the smaller word segmentation result of participle number to be exported, algorithm terminates.
S103:According to the length of the type of the target data and the target data, the de- of the target data is determined
Quick method, and the sensitive data obtained after being segmented to the target data using the desensitization method of the target data is desensitized
Processing.
Referring to Fig. 10, when the type of the target data is electricity consumption address, the implementation procedure of S103 is as follows:
S201:Judge whether the length of the target data is more than the second preset value;If executing S202, execute if not
S203:
S202:Determine that the desensitization method of the target data is the first electricity consumption address date desensitization method;
S204:Using the first station address data desensitization method, extracted from the word segmentation result of the target data
Last 5 data and provinces and cities' district data of doorplate number, obtain remainder data;
S205:Rear 5 data and provinces and cities district data for retaining the doorplate number, to the target data
Remainder data carry out mask, obtain the data after the target data desensitization;
S203:Determine that the desensitization method of the target data is the second electricity consumption address date desensitization method;
S206:Using the second user address date desensitization method, first point is pressed according to the length of the target data
Ladder retention discipline extracts the member-retaining portion of the target data, and carries out mask to the remainder of the target data, obtains
Data after desensitizing to the target data.
For example, being carried out by second user address date desensitization method for 10 words of length and electricity consumption address date below
Data desensitize, and retain by length sublevel ladder, 5 words of length and below, the 1st word of reservation and last 2 words;Length 6-9
Word, retain last 5 words.
Data are carried out by the first station address data desensitization method for the electricity consumption address date of 10 words of length or more
Desensitization.Electricity consumption address is generally made of province, city, district, street/small towns neighbourhood committee/village, road, cell, number part.Door
Trade mark part retains last 5, and province, city, district retain, and other parts are all replaced with *.As follows:
Three tunnel Shandong Ankang garden cell 2-1-101- of Jinan City, Shandong Province Shizhong District mountains and rivers street overline bridge north neighbourhood committee latitude>Mountain
The Jinan Cities Dong Sheng Shizhong District * * * * * * * * * * * * * * * * * * * * * * 1-101.
1 is please referred to Fig.1, when the type of the target data is electricity consumption address, the implementation procedure of S103 is as follows:
S301:Judge whether the length of the target data is more than third preset value;If so, executing S302, execute if not
S303;
S302:Determine that the desensitization method of the target data is the first enterprise-class name in an account book data desensitization method;
S304:Using the first enterprise-class name in an account book data desensitization method, carried from the word segmentation result of the target data
The first character of font size data and the last character of industry data are taken, the remaining data of the font size data and the row are obtained
The remaining data of industry data;
S305:The remaining data of remaining data and the industry data to the font size data carries out mask, retains institute
Other data for stating target data obtain the data after the target data desensitization;
S303:Determine that the desensitization method of the target data is the second enterprise-class name in an account book data desensitization method;
S306:Using the second enterprise-class name in an account book data desensitization method, second is pressed according to the length of the target data
Sublevel ladder retention discipline extracts the member-retaining portion of the target data, and carries out mask to the remainder of the target data,
Obtain the data after the target data desensitization.
For example, being carried out by the second electricity consumption address date desensitization method for 6 words of length enterprise-class name in an account book data below
Data desensitize, and retain by length sublevel ladder, 4 words of length and below, head and the tail 1 word of each reservation;5-6 word of length, it is first
Tail respectively retains 2 words.
Data are carried out by the first electricity consumption address date desensitization method for the enterprise-class name in an account book data of 6 words of length or more
Desensitization.Enterprise-class name in an account book is generally made of region, font size, industry, four part of company organization.Retain front and back region and organization department
It is point constant, mask operation is carried out to font size and industry.Font size part retains first character, and other parts are all replaced with *;Industry
Part retains the last character, and other parts are all replaced with *.As follows:
Qingdao Hui Feng Motor Manufacturing Co. Ltds->Qingdao favour * * * * make Co., Ltd;
2020 commerce services Co., Ltd of Qingdao->Two * * * * * * business Co., Ltd of Qingdao.
A kind of processing method of data desensitization, benchmark word is segmented before data desensitization by calling disclosed in the present embodiment
Library segments target data, obtains the data with certain structure, to there are the parts of sensitive prime information to desensitize
Processing improves the validity of data desensitization to the wholly or largely carry out mask of sensitive information.According to the class of target data
Type calls corresponding sub- dictionary in participle benchmark dictionary, and is divided using segmenting method corresponding with the type of target data
Word improves the accuracy of participle, and the desensitization method of target data is determined according to the type of target data and length, realizes
The differentiation of different type different length data desensitizes, and improves the validity of data desensitization.
2 are please referred to Fig.1, present embodiment discloses the processing methods of another data desensitization, specifically include following steps:
S401:Determine the type of target data;
S402:The corresponding sub- dictionary in participle benchmark dictionary is called according to the type of the target data, and is used and institute
The corresponding segmenting method of type for stating target data is segmented;
S403:Calculate the accuracy of the word segmentation result of the target data;
S404:Judge whether the accuracy of the word segmentation result of the target data is more than the first preset value;If so, executing
S405, if it is not, executing S406;
S405:According to the length of the type of the target data and the target data, the de- of the target data is determined
Quick method, and the sensitive data obtained after being segmented to the target data using the desensitization method of the target data is desensitized
Processing;
S406:The target data is segmented based on hidden markov model, and executes S405.
Using hidden markov model (HMM Hidden Markov Model) to two class of enterprise-class name in an account book and electricity consumption address
Data carry out Chinese word segmentation processing.HMM algorithms, can be in the case where training corpus scale is sufficiently large and Covering domain is enough
Obtain higher cutting accuracy.This kind of segmentation methods model Chinese based on the part of speech and statistical nature that manually mark,
Model parameter is estimated to train according to the data (language material marked) observed.Pass through model again in the participle stage
The probability that various participles occur is calculated, using the word segmentation result of maximum probability as final result.Common sequence labelling model is just
There are HMM algorithms, which can handle ambiguity and unregistered word problem well, and effect ratio is based on string matching effect more
It is good.
Hidden markov model is a dual random process, we do not know specific status switch, only know state
The probability of transfer, i.e. the state conversion process of model is not observable (hidden), and the random process of the event of observable
It is the random function of hidden state conversion process.
The composition of HMM includes:
Status number in model is N;
The different symbolic number M that may be exported from each state;
State transition probability matrix A=aij, wherein aijFor state SiIt is transferred to state SjProbability;
From state CjObserve a certain special symbol OkProbability distribution matrix be:B=bj(k), the probability of symbol is observed again
Claim symbol emission probability;
The probability distribution of original state is:π={ πi}。
Usually, a HMM is denoted as a five-tuple μ=(C, K, A, B, π), wherein C is the set of state, and O is output
The set of symbol, π, A and B are probability distribution, state transition probability and the symbol emission probability of original state respectively.
Chinese word segmentation is using language material to training HMM.Using classical character label model, the set C of four class labels is C
={ B, E, M, S }, meaning is as follows:
B:The beginning of one word
E:The end of one word
M:The centre of one word
S:Individual character is at word
After being marked with four class labels, so that it may to start method one HMM model of structure with statistics, each character
Labeling is only influenced by previous character classification.Acquire the state-transition matrix A and symbol emission probability B of HMM.Its
In:
C={ B, E, M, S } in formula, O={ character set }, Count represents frequency.Calculating BijWhen, due to data
Sparsity, many characters do not appear in training set, this causes probability to be that 0 result appears in B, is asked to repair this
Topic, using the data smoothing technology for adding 1, i.e.,:
We set initial vector π={ 0.5,0.0,0.0,0.5 }, and M and E can not possibly appear in the first place of sentence.So far,
HMM model structure finishes.Based on this HMM model, for an observation sequence, a hiding sequence is obtained with Viterbi algorithm
It arranges { B, E, M, S }.
Viterbi searching algorithms are:
1, it initializes:δ1(i)=πibi(O1),1≤i≤N,
The path variable of maximum probability:
2, recursive calculation:
3, memory rollback path:
4, it terminates:
Path (status switch) is obtained by backtracking:
The time complexity of Viterbi algorithm is O (N2T).Such as " the Changsha Kaifu District streets the Hong Shan communities Shuan He good fortune member west
The output state sequence of No. 199 three phase of the ten thousand state cities present age, 10 this address of Unit 2 1706 " in road is:
“BMEBMEBMMEBMMEBMMEBMMEBMMMMMEBMEBMEBMME”
Can carry out Chinese Word Segmentation according to this status switch is:
“BME|BME|BMME|BMME|BMME|BMME|BMMMMME|BME|BME|BM ME”
Last Chinese Word Segmentation result is as follows:
" Changsha | Kaifu District | the streets Hong Shan | the communities Shuan He | the West Roads Fu Yuan | No. 199 | contemporary three phase of ten thousand state cities | 10 |
Unit two | 1706 ".
The processing method that data disclosed in the present embodiment desensitize uses the maximum forward matching that algorithm complexity is smaller first
Method or two-way maximum matching Chinese word cutting method segment target and carry out word segmentation processing, ensure that the processing speed of word segmentation processing
Degree.The accuracy of word segmentation result is calculated, when word segmentation result accuracy be less than threshold value when using algorithm complexity it is higher but
Also higher hidden markov model segments target data to participle accuracy rate, ensure that the accuracy of word segmentation result.
Based on a kind of processing method of data desensitization disclosed in above-described embodiment, 3 are please referred to Fig.1, the present embodiment corresponds to public
A kind of processing unit of data desensitization has been opened, including:
Type determining units 501, the type for determining target data;
First participle processing unit 502, for calling the phase in participle benchmark dictionary according to the type of the target data
Sub- dictionary is answered, and is segmented using segmenting method corresponding with the type of the target data;
Desensitization process unit 503 is used for the length of the type and the target data according to the target data, determines institute
The desensitization method of target data is stated, and is obtained after being segmented to the target data using the desensitization method of the target data quick
Feel data and carries out desensitization process.
Optionally, described device further includes:
Dictionary construction unit, for building participle benchmark dictionary, the participle benchmark dictionary includes multiple sub- dictionaries, often
A sub- dictionary respectively includes a type of sensitive word.
Optionally, when the type of the target data is electricity consumption address, the first participle processing unit 502 is specifically used
In:
Call the sub- dictionary of general address, the sub- dictionary of name dictionary, cell name and administrative division diversity zygote dictionary, adopt
The target data is segmented with maximum forward matching Chinese word segmentation.
Optionally, when the type of the target data is enterprise-class name in an account book, the first participle processing unit 502 is specific
For:
The sub- dictionary of regional ensemble, industry collection zygote dictionary and company organization is called to collect zygote dictionary, using two-way maximum
It is segmented with Chinese word cutting method.
Optionally, described device further includes:
Computing unit, the accuracy of the word segmentation result for calculating the target data;
End member is judged, for judging whether the accuracy of the word segmentation result of the target data is more than the first preset value;
If so, triggering the desensitization process unit;
If it is not, the second word segmentation processing unit of triggering, the second word segmentation processing unit, for being based on hidden markov model
The target data is segmented, and triggers the desensitization process unit.
Optionally, when the type of the target data is electricity consumption address, the desensitization process unit 503 includes:
First judgment sub-unit, for judging whether the length of the target data is more than the second preset value;
First determination subelement, described in when the length of the target data is more than second preset value, determining
The desensitization method of target data is the first electricity consumption address date desensitization method;
First extraction subelement, for using the first station address data desensitization method, from the target data
Last 5 data and provinces and cities' district data that doorplate number is extracted in word segmentation result, obtain remainder data;
First desensitization process subelement, rear 5 data for retaining the doorplate number and provinces and cities district number
According to carrying out mask to the remainder data of the target data, obtain the data after the target data desensitization;
Second determination subelement, for when the length of the target data is not more than second preset value, determining institute
The desensitization method for stating target data is the second electricity consumption address date desensitization method;
Second desensitization process subelement, for using the second user address date desensitization method, according to the target
The length of data is extracted the member-retaining portion of the target data by the first sublevel ladder retention discipline, and is remained to the target data
Remaining part divides carry out mask, obtains the data after the target data desensitization.
Optionally, when the type of the target data is enterprise-class name in an account book, the desensitization process unit 503 includes:
Second judgment sub-unit, for judging whether the length of the target data is more than third preset value;
Third determination subelement, described in when the length of the target data is more than the third preset value, determining
The desensitization method of target data is the first enterprise-class name in an account book data desensitization method;
Second extraction subelement, for using the first enterprise-class name in an account book data desensitization method, from the target data
Word segmentation result in extraction font size data first character and industry data the last character, obtain the surplus of the font size data
The remaining data of remainder evidence and the industry data;
Third desensitization process subelement, the remainder for remaining data and the industry data to the font size data
According to mask is carried out, retain other data of the target data, obtains the data after the target data desensitization;
4th determination subelement, for when the length of the target data is not more than the third preset value, determining institute
The desensitization method for stating target data is the second enterprise-class name in an account book data desensitization method;
4th desensitization process subelement, for using the second enterprise-class name in an account book data desensitization method, according to the mesh
The length of mark data is extracted the member-retaining portion of the target data by the second sublevel ladder retention discipline, and to the target data
Remainder carries out mask, obtains the data after the target data desensitization.
A kind of processing unit of data desensitization, benchmark word is segmented before data desensitization by calling disclosed in the present embodiment
Library segments target data, obtains the data with certain structure, to there are the parts of sensitive prime information to desensitize
Processing improves the validity of data desensitization to the wholly or largely carry out mask of sensitive information.According to the class of target data
Type calls corresponding sub- dictionary in participle benchmark dictionary, and is divided using segmenting method corresponding with the type of target data
Word improves the accuracy of participle, and the desensitization method of target data is determined according to the type of target data and length, realizes
The differentiation of different type different length data desensitizes, and improves the validity of data desensitization.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention.
Various modifications to these embodiments will be apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest range caused.
Claims (14)
1. a kind of processing method of data desensitization, which is characterized in that including:
Determine the type of target data;
The corresponding sub- dictionary in participle benchmark dictionary is called according to the type of the target data, and is used and the target data
The corresponding segmenting method of type segmented;
According to the length of the type of the target data and the target data, the desensitization method of the target data is determined, and
The sensitive data obtained after being segmented to the target data using the desensitization method of the target data carries out desensitization process.
2. according to the method described in claim 1, it is characterized in that, the method further includes:
Structure participle benchmark dictionary, the participle benchmark dictionary includes multiple sub- dictionaries, and every sub- dictionary respectively includes one kind
The sensitive word of type.
3. according to the method described in claim 1, it is characterized in that, when the type of the target data be electricity consumption address when, institute
The corresponding sub- dictionary called according to the type of the target data in participle benchmark dictionary is stated, using the class with the target data
The corresponding segmenting method of type is segmented, including:
Call the sub- dictionary of general address, the sub- dictionary of name dictionary, cell name and administrative division diversity zygote dictionary, using most
Big positive matching Chinese word segmentation segments the target data.
4. according to the method described in claim 1, it is characterized in that, when the type of the target data be enterprise-class name in an account book when,
It is described according to the type of the target data call participle benchmark dictionary in corresponding sub- dictionary, using with the target data
The corresponding segmenting method of type is segmented, including:
The sub- dictionary of regional ensemble, industry collection zygote dictionary and company organization is called to collect zygote dictionary, using in two-way maximum matching
Literary segmenting method is segmented.
5. according to the method described in claim 1, it is characterized in that, in the type according to the target data and the mesh
The length for marking data, before the desensitization method for determining the target data, the method further includes:
Calculate the accuracy of the word segmentation result of the target data;
Judge whether the accuracy of the word segmentation result of the target data is more than the first preset value;
If so, executing the length of the type and the target data according to the target data, the target data is determined
Desensitization method;
If it is not, being segmented to the target data based on hidden markov model, and execute described according to the target data
Type and the target data length, determine the desensitization method of the target data.
6. according to the method described in claim 1, it is characterized in that, when the type of the target data be electricity consumption address when, institute
The length for stating the type and the target data according to the target data, determines the desensitization method of the target data, and adopts
The sensitive data obtained after being segmented to the target data with the desensitization method of the target data carries out desensitization process, including:
Judge whether the length of the target data is more than the second preset value;
When the length of the target data is more than second preset value, determine that the desensitization method of the target data is first
Electricity consumption address date desensitization method;
Using the first station address data desensitization method, doorplate number is extracted from the word segmentation result of the target data
Last 5 data and provinces and cities' district data, obtain remainder data;
Rear 5 data and provinces and cities district data for retaining the doorplate number, to the remainder of the target data
Data carry out mask, obtain the data after the target data desensitization;
When the length of the target data is not more than second preset value, determine that the desensitization method of the target data is the
Two electricity consumption address date desensitization methods;
Using the second user address date desensitization method, rule are retained by the first sublevel ladder according to the length of the target data
The member-retaining portion of the target data is then extracted, and mask is carried out to the remainder of the target data, obtains the target
Data after data desensitization.
7. according to the method described in claim 1, it is characterized in that, when the type of the target data be enterprise-class name in an account book when,
The length of the type and the target data according to the target data, determines the desensitization method of the target data, and
The sensitive data obtained after being segmented to the target data using the desensitization method of the target data carries out desensitization process, packet
It includes:
Judge whether the length of the target data is more than third preset value;
When the length of the target data is more than the third preset value, determine that the desensitization method of the target data is first
Enterprise-class name in an account book data desensitization method;
Using the first enterprise-class name in an account book data desensitization method, font size data are extracted from the word segmentation result of the target data
First character and industry data the last character, obtain the font size data remaining data and the industry data it is surplus
Remainder evidence;
The remaining data of remaining data and the industry data to the font size data carries out mask, retains the target data
Other data, obtain the data after target data desensitization;
When the length of the target data is not more than the third preset value, determine that the desensitization method of the target data is the
Two enterprise-class name in an account book data desensitization methods;
Using the second enterprise-class name in an account book data desensitization method, retained by the second sublevel ladder according to the length of the target data
The member-retaining portion of target data described in Rule Extraction, and mask is carried out to the remainder of the target data, obtain the mesh
Mark the data after data desensitization.
8. a kind of processing unit of data desensitization, which is characterized in that including:
Type determining units, the type for determining target data;
First participle processing unit, for calling the corresponding sub- word in participle benchmark dictionary according to the type of the target data
Library, and segmented using segmenting method corresponding with the type of the target data;
Desensitization process unit is used for the length of the type and the target data according to the target data, determines the target
The desensitization method of data, and the sensitive data obtained after being segmented to the target data using the desensitization method of the target data
Carry out desensitization process.
9. device according to claim 8, which is characterized in that described device further includes:
Dictionary construction unit, for building participle benchmark dictionary, the participle benchmark dictionary includes multiple sub- dictionaries, per height
Dictionary respectively includes a type of sensitive word.
10. device according to claim 8, which is characterized in that when the type of the target data is electricity consumption address, institute
First participle processing unit is stated to be specifically used for:
Call the sub- dictionary of general address, the sub- dictionary of name dictionary, cell name and administrative division diversity zygote dictionary, using most
Big positive matching Chinese word segmentation segments the target data.
11. device according to claim 8, which is characterized in that when the type of the target data is enterprise-class name in an account book,
The first participle processing unit is specifically used for:
The sub- dictionary of regional ensemble, industry collection zygote dictionary and company organization is called to collect zygote dictionary, using in two-way maximum matching
Literary segmenting method is segmented.
12. device according to claim 8, which is characterized in that described device further includes:
Computing unit, the accuracy of the word segmentation result for calculating the target data;
End member is judged, for judging whether the accuracy of the word segmentation result of the target data is more than the first preset value;
If so, triggering the desensitization process unit;
If it is not, the second word segmentation processing unit of triggering, the second word segmentation processing unit, for being based on hidden markov model to institute
It states target data to be segmented, and triggers the desensitization process unit.
13. device according to claim 8, which is characterized in that when the type of the target data is electricity consumption address, institute
Stating desensitization process unit includes:
First judgment sub-unit, for judging whether the length of the target data is more than the second preset value;
First determination subelement, for when the length of the target data is more than second preset value, determining the target
The desensitization method of data is the first electricity consumption address date desensitization method;
First extraction subelement, for using the first station address data desensitization method, from the participle of the target data
As a result last 5 data and provinces and cities' district data of extraction doorplate number, obtain remainder data in;
First desensitization process subelement, rear 5 data for retaining the doorplate number and provinces and cities district data are right
The remainder data of the target data carry out mask, obtain the data after the target data desensitization;
Second determination subelement, for when the length of the target data is not more than second preset value, determining the mesh
The desensitization method for marking data is the second electricity consumption address date desensitization method;
Second desensitization process subelement, for using the second user address date desensitization method, according to the target data
Length the member-retaining portion of the target data is extracted by the first sublevel ladder retention discipline, and to the remainder of the target data
Divide carry out mask, obtains the data after the target data desensitization.
14. according to the method described in claim 8, it is characterized in that, when the type of the target data be enterprise-class name in an account book when,
The desensitization process unit includes:
Second judgment sub-unit, for judging whether the length of the target data is more than third preset value;
Third determination subelement, for when the length of the target data is more than the third preset value, determining the target
The desensitization method of data is the first enterprise-class name in an account book data desensitization method;
Second extraction subelement, for using the first enterprise-class name in an account book data desensitization method, from point of the target data
The last character that the first character and industry data of font size data are extracted in word result, obtains the remainder of the font size data
According to the remaining data with the industry data;
Third desensitization process subelement, for the remaining data to the remaining datas of the font size data and the industry data into
Row mask retains other data of the target data, obtains the data after the target data desensitization;
4th determination subelement, for when the length of the target data is not more than the third preset value, determining the mesh
The desensitization method for marking data is the second enterprise-class name in an account book data desensitization method;
4th desensitization process subelement, for using the second enterprise-class name in an account book data desensitization method, according to the number of targets
According to length extract by the second sublevel ladder retention discipline the member-retaining portion of the target data, and to the residue of the target data
Part carries out mask, obtains the data after the target data desensitization.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810586230.9A CN108776762B (en) | 2018-06-08 | 2018-06-08 | Data desensitization processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810586230.9A CN108776762B (en) | 2018-06-08 | 2018-06-08 | Data desensitization processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108776762A true CN108776762A (en) | 2018-11-09 |
CN108776762B CN108776762B (en) | 2022-01-28 |
Family
ID=64025970
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810586230.9A Active CN108776762B (en) | 2018-06-08 | 2018-06-08 | Data desensitization processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108776762B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110532805A (en) * | 2019-09-05 | 2019-12-03 | 国网山西省电力公司阳泉供电公司 | Data desensitization method and device |
CN110610196A (en) * | 2019-08-14 | 2019-12-24 | 平安科技(深圳)有限公司 | Desensitization method, system, computer device and computer-readable storage medium |
CN110750984A (en) * | 2019-10-24 | 2020-02-04 | 深圳前海微众银行股份有限公司 | Command line character string processing method, terminal, device and readable storage medium |
CN110851864A (en) * | 2019-11-08 | 2020-02-28 | 国网浙江省电力有限公司信息通信分公司 | Sensitive data automatic identification and processing method and system |
CN110928931A (en) * | 2020-02-17 | 2020-03-27 | 深圳市琦迹技术服务有限公司 | Sensitive data processing method and device, electronic equipment and storage medium |
CN111382457A (en) * | 2018-12-28 | 2020-07-07 | 神州数码医疗科技股份有限公司 | Data risk assessment method and device |
CN111767565A (en) * | 2019-03-15 | 2020-10-13 | 北京京东尚科信息技术有限公司 | Data desensitization processing method, processing device and storage medium |
CN112132238A (en) * | 2020-11-23 | 2020-12-25 | 支付宝(杭州)信息技术有限公司 | Method, device, equipment and readable medium for identifying private data |
CN116719907A (en) * | 2023-06-26 | 2023-09-08 | 阿波罗智联(北京)科技有限公司 | Data processing method, device, equipment and storage medium |
CN117272996A (en) * | 2023-11-23 | 2023-12-22 | 山东网安安全技术有限公司 | Data desensitization system |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104731976A (en) * | 2015-04-14 | 2015-06-24 | 海量云图(北京)数据技术有限公司 | Method for finding and sorting private data in data table |
CN104750852A (en) * | 2015-04-14 | 2015-07-01 | 海量云图(北京)数据技术有限公司 | Method for finding and classifying Chinese address data |
EP2998903A1 (en) * | 2014-09-18 | 2016-03-23 | Kaspersky Lab, ZAO | System and method for robust full-drive encryption |
CN106909630A (en) * | 2017-01-26 | 2017-06-30 | 武汉奇米网络科技有限公司 | Filtering sensitive words method and system based on dynamic dictionary |
CN107145799A (en) * | 2017-05-04 | 2017-09-08 | 山东浪潮云服务信息科技有限公司 | A kind of data desensitization method and device |
CN107480549A (en) * | 2017-06-28 | 2017-12-15 | 银江股份有限公司 | A kind of shared sensitive information desensitization method of data-oriented and system |
CN107609418A (en) * | 2017-08-31 | 2018-01-19 | 深圳市牛鼎丰科技有限公司 | Desensitization method, device, storage device and the computer equipment of text data |
CN107885876A (en) * | 2017-11-29 | 2018-04-06 | 北京安华金和科技有限公司 | A kind of dynamic desensitization method rewritten based on SQL statement |
CN107992771A (en) * | 2017-12-20 | 2018-05-04 | 北京明朝万达科技股份有限公司 | A kind of data desensitization method and device |
-
2018
- 2018-06-08 CN CN201810586230.9A patent/CN108776762B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2998903A1 (en) * | 2014-09-18 | 2016-03-23 | Kaspersky Lab, ZAO | System and method for robust full-drive encryption |
CN104731976A (en) * | 2015-04-14 | 2015-06-24 | 海量云图(北京)数据技术有限公司 | Method for finding and sorting private data in data table |
CN104750852A (en) * | 2015-04-14 | 2015-07-01 | 海量云图(北京)数据技术有限公司 | Method for finding and classifying Chinese address data |
CN106909630A (en) * | 2017-01-26 | 2017-06-30 | 武汉奇米网络科技有限公司 | Filtering sensitive words method and system based on dynamic dictionary |
CN107145799A (en) * | 2017-05-04 | 2017-09-08 | 山东浪潮云服务信息科技有限公司 | A kind of data desensitization method and device |
CN107480549A (en) * | 2017-06-28 | 2017-12-15 | 银江股份有限公司 | A kind of shared sensitive information desensitization method of data-oriented and system |
CN107609418A (en) * | 2017-08-31 | 2018-01-19 | 深圳市牛鼎丰科技有限公司 | Desensitization method, device, storage device and the computer equipment of text data |
CN107885876A (en) * | 2017-11-29 | 2018-04-06 | 北京安华金和科技有限公司 | A kind of dynamic desensitization method rewritten based on SQL statement |
CN107992771A (en) * | 2017-12-20 | 2018-05-04 | 北京明朝万达科技股份有限公司 | A kind of data desensitization method and device |
Non-Patent Citations (1)
Title |
---|
陈天莹 等: "大数据环境下的智能数据脱敏系统", 《通信技术》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111382457A (en) * | 2018-12-28 | 2020-07-07 | 神州数码医疗科技股份有限公司 | Data risk assessment method and device |
CN111382457B (en) * | 2018-12-28 | 2023-08-18 | 神州数码医疗科技股份有限公司 | Data risk assessment method and device |
CN111767565A (en) * | 2019-03-15 | 2020-10-13 | 北京京东尚科信息技术有限公司 | Data desensitization processing method, processing device and storage medium |
CN111767565B (en) * | 2019-03-15 | 2024-04-12 | 北京京东尚科信息技术有限公司 | Data desensitization processing method, processing device and storage medium |
CN110610196A (en) * | 2019-08-14 | 2019-12-24 | 平安科技(深圳)有限公司 | Desensitization method, system, computer device and computer-readable storage medium |
CN110532805A (en) * | 2019-09-05 | 2019-12-03 | 国网山西省电力公司阳泉供电公司 | Data desensitization method and device |
CN110532805B (en) * | 2019-09-05 | 2023-01-24 | 国网山西省电力公司阳泉供电公司 | Data desensitization method and device |
CN110750984A (en) * | 2019-10-24 | 2020-02-04 | 深圳前海微众银行股份有限公司 | Command line character string processing method, terminal, device and readable storage medium |
CN110750984B (en) * | 2019-10-24 | 2023-11-21 | 深圳前海微众银行股份有限公司 | Command line character string processing method, terminal, device and readable storage medium |
CN110851864A (en) * | 2019-11-08 | 2020-02-28 | 国网浙江省电力有限公司信息通信分公司 | Sensitive data automatic identification and processing method and system |
CN110928931A (en) * | 2020-02-17 | 2020-03-27 | 深圳市琦迹技术服务有限公司 | Sensitive data processing method and device, electronic equipment and storage medium |
CN112132238A (en) * | 2020-11-23 | 2020-12-25 | 支付宝(杭州)信息技术有限公司 | Method, device, equipment and readable medium for identifying private data |
CN116719907A (en) * | 2023-06-26 | 2023-09-08 | 阿波罗智联(北京)科技有限公司 | Data processing method, device, equipment and storage medium |
CN117272996A (en) * | 2023-11-23 | 2023-12-22 | 山东网安安全技术有限公司 | Data desensitization system |
CN117272996B (en) * | 2023-11-23 | 2024-02-27 | 山东网安安全技术有限公司 | Data desensitization system |
Also Published As
Publication number | Publication date |
---|---|
CN108776762B (en) | 2022-01-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108776762A (en) | A kind of processing method and processing device of data desensitization | |
CN111709241B (en) | Named entity identification method oriented to network security field | |
Li et al. | InfoXtract location normalization: a hybrid approach to geographic references in information extraction | |
CN106469554B (en) | A kind of adaptive recognition methods and system | |
US5835888A (en) | Statistical language model for inflected languages | |
Mihalcea et al. | Textrank: Bringing order into text | |
CN107239445A (en) | The method and system that a kind of media event based on neutral net is extracted | |
CN108763510A (en) | Intension recognizing method, device, equipment and storage medium | |
CN102081602B (en) | Method and equipment for determining category of unlisted word | |
JP2008243227A (en) | Method and apparatus for generating template used in handwritten character recognition | |
CN103678271B (en) | A kind of text correction method and subscriber equipment | |
CN110287329A (en) | A kind of electric business classification attribute excavation method based on commodity text classification | |
CN109284358B (en) | Chinese address noun hierarchical method and device | |
CN112528664B (en) | Address matching method based on multi-task joint learning and address hierarchical structure knowledge | |
CN109086274B (en) | English social media short text time expression recognition method based on constraint model | |
CN106610937A (en) | Information theory-based Chinese automatic word segmentation method | |
CN108932218A (en) | A kind of example extended method, device, equipment and medium | |
Tsai et al. | Mencius: A Chinese named entity recognizer using the maximum entropy-based hybrid model | |
Skylaki et al. | Named entity recognition in the legal domain using a pointer generator network | |
CN110413972A (en) | A kind of table name field name intelligence complementing method based on NLP technology | |
Ekbal et al. | Voted NER system using appropriate unlabeled data | |
Shah et al. | A deep learning approach for Hindi named entity recognition | |
Naz et al. | Urdu part of speech tagging using transformation based error driven learning | |
CN109871536B (en) | Place name recognition method and device | |
Dunn | Syntactic Variation Across the Grammar: Modelling a Complex Adaptive System |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |