CN110188571A - Desensitization method and system based on sensitive data - Google Patents
Desensitization method and system based on sensitive data Download PDFInfo
- Publication number
- CN110188571A CN110188571A CN201910486536.1A CN201910486536A CN110188571A CN 110188571 A CN110188571 A CN 110188571A CN 201910486536 A CN201910486536 A CN 201910486536A CN 110188571 A CN110188571 A CN 110188571A
- Authority
- CN
- China
- Prior art keywords
- data
- desensitization
- desensitized
- character string
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000586 desensitisation Methods 0.000 title claims abstract description 93
- 238000000034 method Methods 0.000 title claims abstract description 75
- 230000002123 temporal effect Effects 0.000 claims description 27
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000005314 correlation function Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6254—Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Desensitization method provided by the invention based on sensitive data, obtains data to be desensitized;When the data volume for the data that desensitize meets k-means algorithm threshold value, desensitization data are treated using k-means innovatory algorithm and are grouped division, and Laplace noise is added and desensitizes to the data to be desensitized after division;When the data volume for the data that desensitize meets matching replacement threshold value, desensitization data are treated using matching replacement method and are desensitized;Wherein matching replacement method include according to preset types of variables, using keyword be principle treat desensitize data be replaced.This method flexibly configurable supports multi-data source, a variety of data desensitization algorithm.
Description
Technical field
The invention belongs to data desensitization technical fields, and in particular to desensitization method and system based on sensitive data.
Background technique
Currently, personal privacy protection problem oneself through causing the extensive concern of the public, not only in China, in European Union and the U.S.
Also there is newly-increased legislative Protection personal information.It can be seen that if not can solve since big data is issued or shared caused personal
The problem of privacy leakage, will be issued to related data and user brings serious legal risk, and then hinder big data technology
Application and development.
For Privacy Protection, Samarati and Sweeney have been put forward for the first time the concept of anonymization in 1998, in order to
Anonymization is realized in data sharing process, initially use tradition desensitization algorithm is by directly carrying out out-of-order, cover, system to data
One extensive equal operations achieve the purpose that protect privacy of user.
Extensive, by the way that occurrence to be replaced with to the value range of description attribute, come to the quasi- identity property of some in data
Reach the method for anonymization operation.Extensive operation includes that value is extensive and domain is extensive.The extensive also known as universe in domain is recoded, with phone number
For code, one 88888888 by extensive at 8888888*, realization one bigger range of expression.And so on, then it is extensive
At 888888**, until * * * * * * * *.It is general that the extensive hierarchical structure in domain of the multiple extensive formation of codomain warp of certain attribute is referred to as domain
Change.Extensive level is higher, and information loss is bigger.It is worth extensive also known as local to recode, refers to each of primitive attribute domain
It is worth the directly extensive a certain value in general domain.Being worth generalization equally can be with the extensive layer of signals.It is extensive compared to domain, it is worth extensive
Possess higher flexibility, extensive bring information loss can be effectively reduced.
Concealment, can be considered the extensive of highest level, refers to and replace original value with the value most typically changed.As use fixed attribute
Value replaces all properties value of the column.During data anonymousization operation, if some tuples are unable to satisfy anonymous rule
Requirement, can generally take concealment operations.Record can be deleted directly from tables of data where the attribute value being concealed, Huo Zhexiang
Attribute value is answered to be replaced with unified attribute, to keep statistical property.
But existing desensitization method is also difficult to meet the secret protection demand under big data background.
Summary of the invention
For the defects in the prior art, the present invention provides a kind of desensitization method and system based on sensitive data, can spirit
It is living to configure, support multi-data source, a variety of data desensitization algorithm.
In a first aspect, a kind of desensitization method based on sensitive data,
Obtain data to be desensitized;
When the data volume for the data that desensitize meets k-means algorithm threshold value, desensitization is treated using k-means innovatory algorithm
Data are grouped division, and Laplace noise is added and desensitizes to the data to be desensitized after division;
When wait the data that desensitize data volume meet matching replacement threshold value when, using matching replacement method treat desensitization data into
Row desensitization;Wherein matching replacement method includes treating desensitization data progress according to preset types of variables, by principle of keyword
Replacement.
Preferably, the types of variables includes numerical value, character string, time and regular expression.
Preferably, when types of variables is numerical value, the matching replacement method includes:
Extract the numerical chracter of the data to be desensitized;
Calculate the numerical value length of the numerical chracter, line overrun of going forward side by side processing, to obtain initial value;
Character string is converted by the initial value;
According to the numerical value length and preset numerical value desensitization range, mask is converted by designated character in character string;
After character string after conversion is digitized, output.
Preferably, when types of variables is character string, the matching replacement method includes:
Extract the character string of the data to be desensitized;
When belonging to preset character string desensitization range there are partial content in the character string, desensitize character string is belonged to
The content of range replaces with mask, output.
Preferably, when types of variables is the time, the matching replacement method includes:
Extract the temporal information of the data to be desensitized;
If temporal information is effective information, convert the temporal information to it is preset by second hour format, to obtain
Initial time information;
When belonging to preset time desensitization range there are partial content in the initial time information, taken off the time is belonged to
The content of quick range replaces with 0;
After converting the preset standard time for replaced temporal information, output.
Preferably, when types of variables is regular expression, the matching replacement method includes:
It is identified since a data of the data to be desensitized, until recognizing last of the data to be desensitized
Position data;
When there are the data with the regular expression matching, which is replaced with into preset desensitization symbol;
Export replaced data to be desensitized.
Preferably, the data that desensitize of being treated using k-means innovatory algorithm are grouped division, and Laplace is added
Noise carries out desensitization to the data to be desensitized after division and specifically includes:
Cluster centre is set;
The data set comprising statistical classification result is obtained, the mean vector of the data set is calculated;
The distance between each data vector and mean vector in the data set are calculated, current distance is defined as;
If current distance is less than preset minimum range, update minimum range is current distance;
The data to be desensitized are grouped again, save the mean vector newly obtained, and average statistical vector number;
Laplace noise is added according to each grouping, which is calculated;
Data to be desensitized after output calculating.
Second aspect, a kind of desensitization system based on sensitive data, comprising:
Acquisition unit: for obtaining data to be desensitized;
K-means analytical unit: for when the data volume for the data that desensitize meets k-means algorithm threshold value, using k-
Means innovatory algorithm treat desensitization data be grouped division, and be added Laplace noise to the data to be desensitized after division into
Row desensitization;
Match replacement unit: for being replaced when the data volume for the data that desensitize meets matching replacement threshold value using matching
Method treats desensitization data and desensitizes;It is former that wherein matching replacement method, which includes according to preset types of variables, with keyword,
Desensitization data are then treated to be replaced.
Preferably, the types of variables includes numerical value, character string, time and regular expression;
The matching replacement unit is specifically used for:
When types of variables is numerical value, the matching replacement method includes:
Extract the numerical chracter of the data to be desensitized;
Calculate the numerical value length of the numerical chracter, line overrun of going forward side by side processing, to obtain initial value;
Character string is converted by the initial value;
According to the numerical value length and preset numerical value desensitization range, mask is converted by designated character in character string;
After character string after conversion is digitized, output;
When types of variables is character string, the matching replacement method includes:
Extract the character string of the data to be desensitized;
When belonging to preset character string desensitization range there are partial content in the character string, desensitize character string is belonged to
The content of range replaces with mask, output;
When types of variables is the time, the matching replacement method includes:
Extract the temporal information of the data to be desensitized;
If temporal information is effective information, convert the temporal information to it is preset by second hour format, to obtain
Initial time information;
When belonging to preset time desensitization range there are partial content in the initial time information, taken off the time is belonged to
The content of quick range replaces with 0;
After converting the preset standard time for replaced temporal information, output;
When types of variables is regular expression, the matching replacement method includes:
It is identified since a data of the data to be desensitized, until recognizing last of the data to be desensitized
Position data;
When there are the data with the regular expression matching, which is replaced with into preset desensitization symbol;
Export replaced data to be desensitized.
Preferably, the k-means analytical unit is specifically used for:
Cluster centre is set;
The data set comprising statistical classification result is obtained, the mean vector of the data set is calculated;
The distance between each data vector and mean vector in the data set are calculated, current distance is defined as;
If current distance is less than preset minimum range, update minimum range is current distance;
The data to be desensitized are grouped again, save the mean vector newly obtained, and average statistical vector number;
Laplace noise is added according to each grouping, which is calculated;
Data to be desensitized after output calculating.
As shown from the above technical solution, the desensitization method and system provided by the invention based on sensitive data, can flexibly match
It sets, support multi-data source, a variety of data desensitization algorithm.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art
Embodiment or attached drawing needed to be used in the description of the prior art are briefly described.In all the appended drawings, similar element
Or part is generally identified by similar appended drawing reference.In attached drawing, each element or part might not be drawn according to actual ratio.
Fig. 1 is the desensitization frame that the embodiment of the present invention one is applicable in.
Fig. 2 is the method flow diagram for the desensitization method based on sensitive data that the embodiment of the present invention one provides.
Fig. 3 is the matching replacement method flow chart provided by Embodiment 2 of the present invention for numerical value.
Fig. 4 is the matching replacement method flow chart provided by Embodiment 2 of the present invention for character string.
Fig. 5 is the matching replacement method flow chart provided by Embodiment 2 of the present invention for the time.
Fig. 6 is the matching replacement method flow chart provided by Embodiment 2 of the present invention for regular expression.
Fig. 7 is the method flow diagram to be desensitized using k-means innovatory algorithm that the embodiment of the present invention three provides.
Fig. 8 is the module frame chart for the desensitization system based on sensitive data that the embodiment of the present invention four provides.
Specific embodiment
It is described in detail below in conjunction with embodiment of the attached drawing to technical solution of the present invention.Following embodiment is only used for
Clearly illustrate technical solution of the present invention, therefore be only used as example, and cannot be used as a limitation and limit protection model of the invention
It encloses.It should be noted that unless otherwise indicated, technical term or scientific term used in this application are should be belonging to the present invention
The ordinary meaning that field technical staff is understood.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction
Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded
Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment
And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on
Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
As used in this specification and in the appended claims, term " if " can be according to context quilt
Be construed to " when ... " or " once " or " in response to determination " or " in response to detecting ".Similarly, phrase " if it is determined that " or
" if detecting [described condition or event] " can be interpreted to mean according to context " once it is determined that " or " in response to true
It is fixed " or " once detecting [described condition or event] " or " in response to detecting [described condition or event] ".
Desensitization method provided by the invention is suitable for Fig. 1, desensitization frame based on Spark parallel computation.
Embodiment one:
A kind of desensitization method based on sensitive data, referring to fig. 2,
Obtain data to be desensitized;
When the data volume for the data that desensitize meets k-means algorithm threshold value, desensitization is treated using k-means innovatory algorithm
Data are grouped division, and Laplace noise is added and desensitizes to the data to be desensitized after division;
Specifically, all uncontrollable using matching alternative time-consuming and resource for the biggish scene of data volume.Therefore it adopts
With based on spark frame using k-means innovatory algorithm to data grouping after, be added Laplace noise mode carry out more
Quickly desensitization.The program can carry out two parts of statistical classification and desensitization to each attribute.Wherein statistical classification use pair
The mode of attribute progress statistic of classification.And desensitizing is then that Laplace noise is clustered, be grouped and increased according to statistical result
Processing mode.
When wait the data that desensitize data volume meet matching replacement threshold value when, using matching replacement method treat desensitization data into
Row desensitization;Wherein matching replacement method includes treating desensitization data progress according to preset types of variables, by principle of keyword
Replacement.
Specifically, matching replacement method is suitable for use when the lesser data of data volume desensitize.Its effect it is relatively stable and
Controllability is strong.Matching replacement may be implemented to include the steps that desensitizing to fixed numbers, regular expression.
This method propose the desensitization algorithms that the desensitization lower portion of Fig. 1 can be used: matching replacement algorithm cooperates k-
After means innovatory algorithm divides grouping, the function that the mode that Laplace noise cooperates jointly realizes data desensitization is added
Energy.By combination person's two ways, flexibly configurable supports multi-data source, a variety of data desensitization algorithm.
Embodiment two:
Embodiment two on the basis of example 1, increases the following contents:
The types of variables includes numerical value, character string, time and regular expression.
Referring to Fig. 3, when types of variables is numerical value, the matching replacement method includes:
Extract the numerical chracter of the data to be desensitized;
Calculate the numerical value length of the numerical chracter, line overrun of going forward side by side processing, to obtain initial value;
Character string is converted by the initial value;
According to the numerical value length and preset numerical value desensitization range, mask is converted by designated character in character string;
After character string after conversion is digitized, output.
Specifically, for value type, by the way of it is after proposing numerical chracter, it is long according to the numerical value of numerical chracter
(i.e. digit) and numerical value desensitization range are spent, converts the mode of mask character for the designated character (i.e. digital) since intermediate.
Referring to fig. 4, when types of variables is character string, the matching replacement method includes:
Extract the character string of the data to be desensitized;
When belonging to preset character string desensitization range there are partial content in the character string, desensitize character string is belonged to
The content of range replaces with mask, output.
Specifically, for character string mode, the method for use is the commensurate in scope that first desensitizes with character string, when discovery needs are de-
When quick key character, it is replaced.
Referring to Fig. 5, when types of variables is the time, the matching replacement method includes:
Extract the temporal information of the data to be desensitized;
If temporal information is effective information, convert the temporal information to it is preset by second hour format, to obtain
Initial time information;
When belonging to preset time desensitization range there are partial content in the initial time information, taken off the time is belonged to
The content of quick range replaces with 0;
After converting the preset standard time for replaced temporal information, output.
Specifically, for time character mode, first by temporal information be converted into system standard by second timing, Zhi Hougen
A part of data therein are replaced with 0 according to time desensitization range, the standard time is finally converted by system correlation function again,
To realize the effect of desensitization.
Referring to Fig. 6, when types of variables is regular expression, the matching replacement method includes:
It is identified since a data of the data to be desensitized, until recognizing last of the data to be desensitized
Position data;
When there are the data with the regular expression matching, which is replaced with into preset desensitization symbol;
Export replaced data to be desensitized.
Specifically, it is also one kind of desensitization for the calculating of regular expression, but since it is not fixed numbers, needs root
It is gradually matched and is replaced according to regular expression.So identified since the head of data to be desensitized when realizing desensitization,
When encountering a data matched with regularity, start to replace.Continue to traverse subsequent data after terminating, until identification
To the last a data of the data to be desensitized.
Method provided by the embodiment of the present invention, to briefly describe, embodiment part does not refer to place, can refer to aforementioned side
Corresponding contents in method embodiment.
Embodiment three:
Embodiment three increases the following contents on the basis of other embodiments:
Referring to Fig. 7, the data that desensitize of being treated using k-means innovatory algorithm are grouped division, and Laplace is added
Noise carries out desensitization to the data to be desensitized after division and specifically includes:
Cluster centre is set;Such as call calculation k in Kmeans.set Max (k) setting;
The data set comprising statistical classification result is obtained, the mean vector uj of the data set is calculated;Such as enable Sdata=
Kmeans.load Data (hashmap) carries out statistical classification.
The distance between each data vector and mean vector in the data set are calculated, current distance cj is defined as;
If current distance is less than preset minimum range, update minimum range is current distance;
The data to be desensitized are grouped again, save the mean vector newly obtained, and average statistical vector number;
Laplace noise is added according to each grouping, which is calculated;
Data to be desensitized after output calculating.
Specifically, this method sets cluster centre first, reads the data set of statistical classification result later.Obtaining data
After collection, the distance between each data set and mean vector are calculated.Mean vector is updated after completing to calculate and data are carried out
Again it is grouped.Mean value number in mean value in obtained group and group is saved, each respective items are known as a record strip life.
Each record entry is calculated later, the mode of calculating is the mode that laplace noise is added in data entry.
Method provided by the embodiment of the present invention, to briefly describe, embodiment part does not refer to place, can refer to aforementioned side
Corresponding contents in method embodiment.
Example IV:
A kind of desensitization system based on sensitive data, referring to Fig. 8, comprising:
Acquisition unit: for obtaining data to be desensitized;
K-means analytical unit: for when the data volume for the data that desensitize meets k-means algorithm threshold value, using k-
Means innovatory algorithm treat desensitization data be grouped division, and be added Laplace noise to the data to be desensitized after division into
Row desensitization;
Match replacement unit: for being replaced when the data volume for the data that desensitize meets matching replacement threshold value using matching
Method treats desensitization data and desensitizes;It is former that wherein matching replacement method, which includes according to preset types of variables, with keyword,
Desensitization data are then treated to be replaced.
Preferably, the types of variables includes numerical value, character string, time and regular expression;
The matching replacement unit is specifically used for:
When types of variables is numerical value, the matching replacement method includes:
Extract the numerical chracter of the data to be desensitized;
Calculate the numerical value length of the numerical chracter, line overrun of going forward side by side processing, to obtain initial value;
Character string is converted by the initial value;
According to the numerical value length and preset numerical value desensitization range, mask is converted by designated character in character string;
After character string after conversion is digitized, output;
When types of variables is character string, the matching replacement method includes:
Extract the character string of the data to be desensitized;
When belonging to preset character string desensitization range there are partial content in the character string, desensitize character string is belonged to
The content of range replaces with mask, output;
When types of variables is the time, the matching replacement method includes:
Extract the temporal information of the data to be desensitized;
If temporal information is effective information, convert the temporal information to it is preset by second hour format, to obtain
Initial time information;
When belonging to preset time desensitization range there are partial content in the initial time information, taken off the time is belonged to
The content of quick range replaces with 0;
After converting the preset standard time for replaced temporal information, output;
When types of variables is regular expression, the matching replacement method includes:
It is identified since a data of the data to be desensitized, until recognizing last of the data to be desensitized
Position data;
When there are the data with the regular expression matching, which is replaced with into preset desensitization symbol;
Export replaced data to be desensitized.
Preferably, the k-means analytical unit is specifically used for:
Cluster centre is set;
The data set comprising statistical classification result is obtained, the mean vector of the data set is calculated;
The distance between each data vector and mean vector in the data set are calculated, current distance is defined as;
If current distance is less than preset minimum range, update minimum range is current distance;
The data to be desensitized are grouped again, save the mean vector newly obtained, and average statistical vector number;
Laplace noise is added according to each grouping, which is calculated;
Data to be desensitized after output calculating.
After the system divides grouping using matching replacement algorithm cooperation k-means innovatory algorithm, Laplace is added
The mode that noise cooperates jointly realizes the function of data desensitization.By combination person's two ways, flexibly configurable supports most evidences
Desensitize algorithm for source, a variety of data.
System provided by the embodiment of the present invention, to briefly describe, embodiment part does not refer to place, can refer to aforementioned side
Corresponding contents in method embodiment.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent
Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to
So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into
Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution
The range of scheme should all cover within the scope of the claims and the description of the invention.
Claims (10)
1. a kind of desensitization method based on sensitive data, which is characterized in that
Obtain data to be desensitized;
When the data volume for the data that desensitize meets k-means algorithm threshold value, desensitization data are treated using k-means innovatory algorithm
It is grouped division, and Laplace noise is added and desensitizes to the data to be desensitized after division;
When the data volume for the data that desensitize meets matching replacement threshold value, desensitization data are treated using matching replacement method and are taken off
It is quick;Wherein matching replacement method include according to preset types of variables, using keyword be principle treat desensitize data be replaced.
2. according to claim 1 based on the desensitization method of sensitive data, which is characterized in that
The types of variables includes numerical value, character string, time and regular expression.
3. according to claim 2 based on the desensitization method of sensitive data, which is characterized in that
When types of variables is numerical value, the matching replacement method includes:
Extract the numerical chracter of the data to be desensitized;
Calculate the numerical value length of the numerical chracter, line overrun of going forward side by side processing, to obtain initial value;
Character string is converted by the initial value;
According to the numerical value length and preset numerical value desensitization range, mask is converted by designated character in character string;
After character string after conversion is digitized, output.
4. according to claim 2 based on the desensitization method of sensitive data, which is characterized in that
When types of variables is character string, the matching replacement method includes:
Extract the character string of the data to be desensitized;
When belonging to preset character string desensitization range there are partial content in the character string, desensitize range character string is belonged to
Content replace with mask, export.
5. according to claim 2 based on the desensitization method of sensitive data, which is characterized in that
When types of variables is the time, the matching replacement method includes:
Extract the temporal information of the data to be desensitized;
If temporal information is effective information, convert the temporal information to it is preset by second hour format, it is initial to obtain
Temporal information;
When belonging to preset time desensitization range there are partial content in the initial time information, desensitize model the time is belonged to
The content enclosed replaces with 0;
After converting the preset standard time for replaced temporal information, output.
6. according to claim 2 based on the desensitization method of sensitive data, which is characterized in that
When types of variables is regular expression, the matching replacement method includes:
It is identified since a data of the data to be desensitized, until recognizing the last one digit number of the data to be desensitized
According to;
When there are the data with the regular expression matching, which is replaced with into preset desensitization symbol;
Export replaced data to be desensitized.
7. any desensitization method based on sensitive data in -6 according to claim 1, which is characterized in that
It is described desensitization data are treated using k-means innovatory algorithm to be grouped division, and after Laplace noise is added to division
Data to be desensitized carry out desensitization specifically include:
Cluster centre is set;
The data set comprising statistical classification result is obtained, the mean vector of the data set is calculated;
The distance between each data vector and mean vector in the data set are calculated, current distance is defined as;
If current distance is less than preset minimum range, update minimum range is current distance;
The data to be desensitized are grouped again, save the mean vector newly obtained, and average statistical vector number;
Laplace noise is added according to each grouping, which is calculated;
Data to be desensitized after output calculating.
8. a kind of desensitization system based on sensitive data characterized by comprising
Acquisition unit: for obtaining data to be desensitized;
K-means analytical unit: for when the data volume for the data that desensitize meets k-means algorithm threshold value, using k-means
Innovatory algorithm treats desensitization data and is grouped division, and Laplace noise is added and takes off to the data to be desensitized after division
It is quick;
Match replacement unit: for when wait the data that desensitize data volume meet matching replacement threshold value when, using matching replacement method
Desensitization data are treated to desensitize;Wherein matching replacement method includes according to preset types of variables, using keyword as principle pair
Data to be desensitized are replaced.
9. the desensitization system based on sensitive data according to claim 8, which is characterized in that
The types of variables includes numerical value, character string, time and regular expression;
The matching replacement unit is specifically used for:
When types of variables is numerical value, the matching replacement method includes:
Extract the numerical chracter of the data to be desensitized;
Calculate the numerical value length of the numerical chracter, line overrun of going forward side by side processing, to obtain initial value;
Character string is converted by the initial value;
According to the numerical value length and preset numerical value desensitization range, mask is converted by designated character in character string;
After character string after conversion is digitized, output;
When types of variables is character string, the matching replacement method includes:
Extract the character string of the data to be desensitized;
When belonging to preset character string desensitization range there are partial content in the character string, desensitize range character string is belonged to
Content replace with mask, export;
When types of variables is the time, the matching replacement method includes:
Extract the temporal information of the data to be desensitized;
If temporal information is effective information, convert the temporal information to it is preset by second hour format, it is initial to obtain
Temporal information;
When belonging to preset time desensitization range there are partial content in the initial time information, desensitize model the time is belonged to
The content enclosed replaces with 0;
After converting the preset standard time for replaced temporal information, output;
When types of variables is regular expression, the matching replacement method includes:
It is identified since a data of the data to be desensitized, until recognizing the last one digit number of the data to be desensitized
According to;
When there are the data with the regular expression matching, which is replaced with into preset desensitization symbol;
Export replaced data to be desensitized.
10. the desensitization system based on sensitive data according to claim 8 or claim 9, which is characterized in that
The k-means analytical unit is specifically used for:
Cluster centre is set;
The data set comprising statistical classification result is obtained, the mean vector of the data set is calculated;
The distance between each data vector and mean vector in the data set are calculated, current distance is defined as;
If current distance is less than preset minimum range, update minimum range is current distance;
The data to be desensitized are grouped again, save the mean vector newly obtained, and average statistical vector number;
Laplace noise is added according to each grouping, which is calculated;
Data to be desensitized after output calculating.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910486536.1A CN110188571A (en) | 2019-06-05 | 2019-06-05 | Desensitization method and system based on sensitive data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910486536.1A CN110188571A (en) | 2019-06-05 | 2019-06-05 | Desensitization method and system based on sensitive data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110188571A true CN110188571A (en) | 2019-08-30 |
Family
ID=67720500
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910486536.1A Pending CN110188571A (en) | 2019-06-05 | 2019-06-05 | Desensitization method and system based on sensitive data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110188571A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110795751A (en) * | 2019-10-30 | 2020-02-14 | 浪潮云信息技术有限公司 | Method for carrying out safety protection on sensitive data through natural language analysis |
CN111563272A (en) * | 2020-04-30 | 2020-08-21 | 支付宝实验室(新加坡)有限公司 | Information statistical method and device |
WO2021042918A1 (en) * | 2019-09-02 | 2021-03-11 | 深圳壹账通智能科技有限公司 | Safe desensitization method and apparatus based on time and date data and computer device |
CN112632606A (en) * | 2020-12-23 | 2021-04-09 | 天津理工大学 | SNOMED-CT-based medical text document desensitization method and system |
CN115422594A (en) * | 2022-09-20 | 2022-12-02 | 成都比特信安科技有限公司 | Method for realizing data desensitization by using matrix replacement |
CN116205236A (en) * | 2023-05-06 | 2023-06-02 | 四川三合力通科技发展集团有限公司 | Data rapid desensitization system and method based on entity naming identification |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107480549A (en) * | 2017-06-28 | 2017-12-15 | 银江股份有限公司 | A kind of shared sensitive information desensitization method of data-oriented and system |
CN107766577A (en) * | 2017-11-15 | 2018-03-06 | 北京百度网讯科技有限公司 | A kind of public sentiment monitoring method, device, equipment and storage medium |
CN108512807A (en) * | 2017-02-24 | 2018-09-07 | 中国移动通信集团公司 | Data desensitization method and data in a kind of data transmission desensitize server |
CN108984588A (en) * | 2018-05-28 | 2018-12-11 | 国政通科技股份有限公司 | A kind of data processing method and device |
-
2019
- 2019-06-05 CN CN201910486536.1A patent/CN110188571A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108512807A (en) * | 2017-02-24 | 2018-09-07 | 中国移动通信集团公司 | Data desensitization method and data in a kind of data transmission desensitize server |
CN107480549A (en) * | 2017-06-28 | 2017-12-15 | 银江股份有限公司 | A kind of shared sensitive information desensitization method of data-oriented and system |
CN107766577A (en) * | 2017-11-15 | 2018-03-06 | 北京百度网讯科技有限公司 | A kind of public sentiment monitoring method, device, equipment and storage medium |
CN108984588A (en) * | 2018-05-28 | 2018-12-11 | 国政通科技股份有限公司 | A kind of data processing method and device |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021042918A1 (en) * | 2019-09-02 | 2021-03-11 | 深圳壹账通智能科技有限公司 | Safe desensitization method and apparatus based on time and date data and computer device |
CN110795751A (en) * | 2019-10-30 | 2020-02-14 | 浪潮云信息技术有限公司 | Method for carrying out safety protection on sensitive data through natural language analysis |
CN111563272A (en) * | 2020-04-30 | 2020-08-21 | 支付宝实验室(新加坡)有限公司 | Information statistical method and device |
WO2021218660A1 (en) * | 2020-04-30 | 2021-11-04 | 支付宝实验室(新加坡)有限公司 | Information statistics |
CN111563272B (en) * | 2020-04-30 | 2021-11-09 | 支付宝实验室(新加坡)有限公司 | Information statistical method and device |
CN112632606A (en) * | 2020-12-23 | 2021-04-09 | 天津理工大学 | SNOMED-CT-based medical text document desensitization method and system |
CN112632606B (en) * | 2020-12-23 | 2022-12-09 | 天津理工大学 | SNOMED-CT-based medical text document desensitization method and system |
CN115422594A (en) * | 2022-09-20 | 2022-12-02 | 成都比特信安科技有限公司 | Method for realizing data desensitization by using matrix replacement |
CN116205236A (en) * | 2023-05-06 | 2023-06-02 | 四川三合力通科技发展集团有限公司 | Data rapid desensitization system and method based on entity naming identification |
CN116205236B (en) * | 2023-05-06 | 2023-08-18 | 四川三合力通科技发展集团有限公司 | Data rapid desensitization system and method based on entity naming identification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110188571A (en) | Desensitization method and system based on sensitive data | |
CN107145799A (en) | A kind of data desensitization method and device | |
WO2017084586A1 (en) | Method , system, and device for inferring malicious code rule based on deep learning method | |
CN105917327A (en) | System and method for inputting text into electronic devices | |
CN110188565A (en) | Data desensitization method, device, computer equipment and storage medium | |
CN104077420B (en) | Method and device for importing data into HBase database | |
CN112765991B (en) | Knowledge enhancement-based deep dialogue semantic role labeling method and system | |
CN109389212A (en) | A kind of restructural activation quantization pond system towards low-bit width convolutional neural networks | |
CN114491525B (en) | Android malicious software detection feature extraction method based on deep reinforcement learning | |
CN110059129A (en) | Date storage method, device and electronic equipment | |
CN110489997A (en) | A kind of sensitive information desensitization method based on pattern matching algorithm | |
CN108829740A (en) | Date storage method and device | |
CN106209366A (en) | A kind of data guard method of fail-safe computer | |
CN111191008A (en) | Password guessing method based on numerical factor reverse order | |
CN105608197B (en) | The acquisition methods and system of Memcache data under a kind of high concurrent | |
CN110362343A (en) | The method of the detection bytecode similarity of N-Gram | |
CN102915344A (en) | SQL (structured query language) statement processing method and device | |
CN109600520A (en) | Harassing call number identification method, device and equipment | |
CN110135184A (en) | A kind of method, apparatus, equipment and the storage medium of static data desensitization | |
CN109800337A (en) | A kind of multi-mode canonical matching algorithm suitable for big alphabet | |
CN103336761B (en) | Matching algorithm is filtered in the interference divided based on dynamic with semantic weighting | |
CN108875390A (en) | A kind of shared economic data processing method in community | |
CN108985759B (en) | Address generating method, system, equipment and storage medium for cryptocurrency | |
CN112765330A (en) | Text data processing method and device, electronic equipment and storage medium | |
CN110457940B (en) | Differential privacy measurement method based on graph theory and mutual information quantity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190830 |