CN110188571A - Desensitization method and system based on sensitive data - Google Patents

Desensitization method and system based on sensitive data Download PDF

Info

Publication number
CN110188571A
CN110188571A CN201910486536.1A CN201910486536A CN110188571A CN 110188571 A CN110188571 A CN 110188571A CN 201910486536 A CN201910486536 A CN 201910486536A CN 110188571 A CN110188571 A CN 110188571A
Authority
CN
China
Prior art keywords
data
desensitization
desensitized
character string
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910486536.1A
Other languages
Chinese (zh)
Inventor
李适季
周莅涛
施全立
白林
陈天立
张宏伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN YOUWANG TECHNOLOGY Co Ltd
Original Assignee
SHENZHEN YOUWANG TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN YOUWANG TECHNOLOGY Co Ltd filed Critical SHENZHEN YOUWANG TECHNOLOGY Co Ltd
Priority to CN201910486536.1A priority Critical patent/CN110188571A/en
Publication of CN110188571A publication Critical patent/CN110188571A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Desensitization method provided by the invention based on sensitive data, obtains data to be desensitized;When the data volume for the data that desensitize meets k-means algorithm threshold value, desensitization data are treated using k-means innovatory algorithm and are grouped division, and Laplace noise is added and desensitizes to the data to be desensitized after division;When the data volume for the data that desensitize meets matching replacement threshold value, desensitization data are treated using matching replacement method and are desensitized;Wherein matching replacement method include according to preset types of variables, using keyword be principle treat desensitize data be replaced.This method flexibly configurable supports multi-data source, a variety of data desensitization algorithm.

Description

Desensitization method and system based on sensitive data
Technical field
The invention belongs to data desensitization technical fields, and in particular to desensitization method and system based on sensitive data.
Background technique
Currently, personal privacy protection problem oneself through causing the extensive concern of the public, not only in China, in European Union and the U.S. Also there is newly-increased legislative Protection personal information.It can be seen that if not can solve since big data is issued or shared caused personal The problem of privacy leakage, will be issued to related data and user brings serious legal risk, and then hinder big data technology Application and development.
For Privacy Protection, Samarati and Sweeney have been put forward for the first time the concept of anonymization in 1998, in order to Anonymization is realized in data sharing process, initially use tradition desensitization algorithm is by directly carrying out out-of-order, cover, system to data One extensive equal operations achieve the purpose that protect privacy of user.
Extensive, by the way that occurrence to be replaced with to the value range of description attribute, come to the quasi- identity property of some in data Reach the method for anonymization operation.Extensive operation includes that value is extensive and domain is extensive.The extensive also known as universe in domain is recoded, with phone number For code, one 88888888 by extensive at 8888888*, realization one bigger range of expression.And so on, then it is extensive At 888888**, until * * * * * * * *.It is general that the extensive hierarchical structure in domain of the multiple extensive formation of codomain warp of certain attribute is referred to as domain Change.Extensive level is higher, and information loss is bigger.It is worth extensive also known as local to recode, refers to each of primitive attribute domain It is worth the directly extensive a certain value in general domain.Being worth generalization equally can be with the extensive layer of signals.It is extensive compared to domain, it is worth extensive Possess higher flexibility, extensive bring information loss can be effectively reduced.
Concealment, can be considered the extensive of highest level, refers to and replace original value with the value most typically changed.As use fixed attribute Value replaces all properties value of the column.During data anonymousization operation, if some tuples are unable to satisfy anonymous rule Requirement, can generally take concealment operations.Record can be deleted directly from tables of data where the attribute value being concealed, Huo Zhexiang Attribute value is answered to be replaced with unified attribute, to keep statistical property.
But existing desensitization method is also difficult to meet the secret protection demand under big data background.
Summary of the invention
For the defects in the prior art, the present invention provides a kind of desensitization method and system based on sensitive data, can spirit It is living to configure, support multi-data source, a variety of data desensitization algorithm.
In a first aspect, a kind of desensitization method based on sensitive data,
Obtain data to be desensitized;
When the data volume for the data that desensitize meets k-means algorithm threshold value, desensitization is treated using k-means innovatory algorithm Data are grouped division, and Laplace noise is added and desensitizes to the data to be desensitized after division;
When wait the data that desensitize data volume meet matching replacement threshold value when, using matching replacement method treat desensitization data into Row desensitization;Wherein matching replacement method includes treating desensitization data progress according to preset types of variables, by principle of keyword Replacement.
Preferably, the types of variables includes numerical value, character string, time and regular expression.
Preferably, when types of variables is numerical value, the matching replacement method includes:
Extract the numerical chracter of the data to be desensitized;
Calculate the numerical value length of the numerical chracter, line overrun of going forward side by side processing, to obtain initial value;
Character string is converted by the initial value;
According to the numerical value length and preset numerical value desensitization range, mask is converted by designated character in character string;
After character string after conversion is digitized, output.
Preferably, when types of variables is character string, the matching replacement method includes:
Extract the character string of the data to be desensitized;
When belonging to preset character string desensitization range there are partial content in the character string, desensitize character string is belonged to The content of range replaces with mask, output.
Preferably, when types of variables is the time, the matching replacement method includes:
Extract the temporal information of the data to be desensitized;
If temporal information is effective information, convert the temporal information to it is preset by second hour format, to obtain Initial time information;
When belonging to preset time desensitization range there are partial content in the initial time information, taken off the time is belonged to The content of quick range replaces with 0;
After converting the preset standard time for replaced temporal information, output.
Preferably, when types of variables is regular expression, the matching replacement method includes:
It is identified since a data of the data to be desensitized, until recognizing last of the data to be desensitized Position data;
When there are the data with the regular expression matching, which is replaced with into preset desensitization symbol;
Export replaced data to be desensitized.
Preferably, the data that desensitize of being treated using k-means innovatory algorithm are grouped division, and Laplace is added Noise carries out desensitization to the data to be desensitized after division and specifically includes:
Cluster centre is set;
The data set comprising statistical classification result is obtained, the mean vector of the data set is calculated;
The distance between each data vector and mean vector in the data set are calculated, current distance is defined as;
If current distance is less than preset minimum range, update minimum range is current distance;
The data to be desensitized are grouped again, save the mean vector newly obtained, and average statistical vector number;
Laplace noise is added according to each grouping, which is calculated;
Data to be desensitized after output calculating.
Second aspect, a kind of desensitization system based on sensitive data, comprising:
Acquisition unit: for obtaining data to be desensitized;
K-means analytical unit: for when the data volume for the data that desensitize meets k-means algorithm threshold value, using k- Means innovatory algorithm treat desensitization data be grouped division, and be added Laplace noise to the data to be desensitized after division into Row desensitization;
Match replacement unit: for being replaced when the data volume for the data that desensitize meets matching replacement threshold value using matching Method treats desensitization data and desensitizes;It is former that wherein matching replacement method, which includes according to preset types of variables, with keyword, Desensitization data are then treated to be replaced.
Preferably, the types of variables includes numerical value, character string, time and regular expression;
The matching replacement unit is specifically used for:
When types of variables is numerical value, the matching replacement method includes:
Extract the numerical chracter of the data to be desensitized;
Calculate the numerical value length of the numerical chracter, line overrun of going forward side by side processing, to obtain initial value;
Character string is converted by the initial value;
According to the numerical value length and preset numerical value desensitization range, mask is converted by designated character in character string;
After character string after conversion is digitized, output;
When types of variables is character string, the matching replacement method includes:
Extract the character string of the data to be desensitized;
When belonging to preset character string desensitization range there are partial content in the character string, desensitize character string is belonged to The content of range replaces with mask, output;
When types of variables is the time, the matching replacement method includes:
Extract the temporal information of the data to be desensitized;
If temporal information is effective information, convert the temporal information to it is preset by second hour format, to obtain Initial time information;
When belonging to preset time desensitization range there are partial content in the initial time information, taken off the time is belonged to The content of quick range replaces with 0;
After converting the preset standard time for replaced temporal information, output;
When types of variables is regular expression, the matching replacement method includes:
It is identified since a data of the data to be desensitized, until recognizing last of the data to be desensitized Position data;
When there are the data with the regular expression matching, which is replaced with into preset desensitization symbol;
Export replaced data to be desensitized.
Preferably, the k-means analytical unit is specifically used for:
Cluster centre is set;
The data set comprising statistical classification result is obtained, the mean vector of the data set is calculated;
The distance between each data vector and mean vector in the data set are calculated, current distance is defined as;
If current distance is less than preset minimum range, update minimum range is current distance;
The data to be desensitized are grouped again, save the mean vector newly obtained, and average statistical vector number;
Laplace noise is added according to each grouping, which is calculated;
Data to be desensitized after output calculating.
As shown from the above technical solution, the desensitization method and system provided by the invention based on sensitive data, can flexibly match It sets, support multi-data source, a variety of data desensitization algorithm.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art are briefly described.In all the appended drawings, similar element Or part is generally identified by similar appended drawing reference.In attached drawing, each element or part might not be drawn according to actual ratio.
Fig. 1 is the desensitization frame that the embodiment of the present invention one is applicable in.
Fig. 2 is the method flow diagram for the desensitization method based on sensitive data that the embodiment of the present invention one provides.
Fig. 3 is the matching replacement method flow chart provided by Embodiment 2 of the present invention for numerical value.
Fig. 4 is the matching replacement method flow chart provided by Embodiment 2 of the present invention for character string.
Fig. 5 is the matching replacement method flow chart provided by Embodiment 2 of the present invention for the time.
Fig. 6 is the matching replacement method flow chart provided by Embodiment 2 of the present invention for regular expression.
Fig. 7 is the method flow diagram to be desensitized using k-means innovatory algorithm that the embodiment of the present invention three provides.
Fig. 8 is the module frame chart for the desensitization system based on sensitive data that the embodiment of the present invention four provides.
Specific embodiment
It is described in detail below in conjunction with embodiment of the attached drawing to technical solution of the present invention.Following embodiment is only used for Clearly illustrate technical solution of the present invention, therefore be only used as example, and cannot be used as a limitation and limit protection model of the invention It encloses.It should be noted that unless otherwise indicated, technical term or scientific term used in this application are should be belonging to the present invention The ordinary meaning that field technical staff is understood.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
As used in this specification and in the appended claims, term " if " can be according to context quilt Be construed to " when ... " or " once " or " in response to determination " or " in response to detecting ".Similarly, phrase " if it is determined that " or " if detecting [described condition or event] " can be interpreted to mean according to context " once it is determined that " or " in response to true It is fixed " or " once detecting [described condition or event] " or " in response to detecting [described condition or event] ".
Desensitization method provided by the invention is suitable for Fig. 1, desensitization frame based on Spark parallel computation.
Embodiment one:
A kind of desensitization method based on sensitive data, referring to fig. 2,
Obtain data to be desensitized;
When the data volume for the data that desensitize meets k-means algorithm threshold value, desensitization is treated using k-means innovatory algorithm Data are grouped division, and Laplace noise is added and desensitizes to the data to be desensitized after division;
Specifically, all uncontrollable using matching alternative time-consuming and resource for the biggish scene of data volume.Therefore it adopts With based on spark frame using k-means innovatory algorithm to data grouping after, be added Laplace noise mode carry out more Quickly desensitization.The program can carry out two parts of statistical classification and desensitization to each attribute.Wherein statistical classification use pair The mode of attribute progress statistic of classification.And desensitizing is then that Laplace noise is clustered, be grouped and increased according to statistical result Processing mode.
When wait the data that desensitize data volume meet matching replacement threshold value when, using matching replacement method treat desensitization data into Row desensitization;Wherein matching replacement method includes treating desensitization data progress according to preset types of variables, by principle of keyword Replacement.
Specifically, matching replacement method is suitable for use when the lesser data of data volume desensitize.Its effect it is relatively stable and Controllability is strong.Matching replacement may be implemented to include the steps that desensitizing to fixed numbers, regular expression.
This method propose the desensitization algorithms that the desensitization lower portion of Fig. 1 can be used: matching replacement algorithm cooperates k- After means innovatory algorithm divides grouping, the function that the mode that Laplace noise cooperates jointly realizes data desensitization is added Energy.By combination person's two ways, flexibly configurable supports multi-data source, a variety of data desensitization algorithm.
Embodiment two:
Embodiment two on the basis of example 1, increases the following contents:
The types of variables includes numerical value, character string, time and regular expression.
Referring to Fig. 3, when types of variables is numerical value, the matching replacement method includes:
Extract the numerical chracter of the data to be desensitized;
Calculate the numerical value length of the numerical chracter, line overrun of going forward side by side processing, to obtain initial value;
Character string is converted by the initial value;
According to the numerical value length and preset numerical value desensitization range, mask is converted by designated character in character string;
After character string after conversion is digitized, output.
Specifically, for value type, by the way of it is after proposing numerical chracter, it is long according to the numerical value of numerical chracter (i.e. digit) and numerical value desensitization range are spent, converts the mode of mask character for the designated character (i.e. digital) since intermediate.
Referring to fig. 4, when types of variables is character string, the matching replacement method includes:
Extract the character string of the data to be desensitized;
When belonging to preset character string desensitization range there are partial content in the character string, desensitize character string is belonged to The content of range replaces with mask, output.
Specifically, for character string mode, the method for use is the commensurate in scope that first desensitizes with character string, when discovery needs are de- When quick key character, it is replaced.
Referring to Fig. 5, when types of variables is the time, the matching replacement method includes:
Extract the temporal information of the data to be desensitized;
If temporal information is effective information, convert the temporal information to it is preset by second hour format, to obtain Initial time information;
When belonging to preset time desensitization range there are partial content in the initial time information, taken off the time is belonged to The content of quick range replaces with 0;
After converting the preset standard time for replaced temporal information, output.
Specifically, for time character mode, first by temporal information be converted into system standard by second timing, Zhi Hougen A part of data therein are replaced with 0 according to time desensitization range, the standard time is finally converted by system correlation function again, To realize the effect of desensitization.
Referring to Fig. 6, when types of variables is regular expression, the matching replacement method includes:
It is identified since a data of the data to be desensitized, until recognizing last of the data to be desensitized Position data;
When there are the data with the regular expression matching, which is replaced with into preset desensitization symbol;
Export replaced data to be desensitized.
Specifically, it is also one kind of desensitization for the calculating of regular expression, but since it is not fixed numbers, needs root It is gradually matched and is replaced according to regular expression.So identified since the head of data to be desensitized when realizing desensitization, When encountering a data matched with regularity, start to replace.Continue to traverse subsequent data after terminating, until identification To the last a data of the data to be desensitized.
Method provided by the embodiment of the present invention, to briefly describe, embodiment part does not refer to place, can refer to aforementioned side Corresponding contents in method embodiment.
Embodiment three:
Embodiment three increases the following contents on the basis of other embodiments:
Referring to Fig. 7, the data that desensitize of being treated using k-means innovatory algorithm are grouped division, and Laplace is added Noise carries out desensitization to the data to be desensitized after division and specifically includes:
Cluster centre is set;Such as call calculation k in Kmeans.set Max (k) setting;
The data set comprising statistical classification result is obtained, the mean vector uj of the data set is calculated;Such as enable Sdata= Kmeans.load Data (hashmap) carries out statistical classification.
The distance between each data vector and mean vector in the data set are calculated, current distance cj is defined as;
If current distance is less than preset minimum range, update minimum range is current distance;
The data to be desensitized are grouped again, save the mean vector newly obtained, and average statistical vector number;
Laplace noise is added according to each grouping, which is calculated;
Data to be desensitized after output calculating.
Specifically, this method sets cluster centre first, reads the data set of statistical classification result later.Obtaining data After collection, the distance between each data set and mean vector are calculated.Mean vector is updated after completing to calculate and data are carried out Again it is grouped.Mean value number in mean value in obtained group and group is saved, each respective items are known as a record strip life. Each record entry is calculated later, the mode of calculating is the mode that laplace noise is added in data entry.
Method provided by the embodiment of the present invention, to briefly describe, embodiment part does not refer to place, can refer to aforementioned side Corresponding contents in method embodiment.
Example IV:
A kind of desensitization system based on sensitive data, referring to Fig. 8, comprising:
Acquisition unit: for obtaining data to be desensitized;
K-means analytical unit: for when the data volume for the data that desensitize meets k-means algorithm threshold value, using k- Means innovatory algorithm treat desensitization data be grouped division, and be added Laplace noise to the data to be desensitized after division into Row desensitization;
Match replacement unit: for being replaced when the data volume for the data that desensitize meets matching replacement threshold value using matching Method treats desensitization data and desensitizes;It is former that wherein matching replacement method, which includes according to preset types of variables, with keyword, Desensitization data are then treated to be replaced.
Preferably, the types of variables includes numerical value, character string, time and regular expression;
The matching replacement unit is specifically used for:
When types of variables is numerical value, the matching replacement method includes:
Extract the numerical chracter of the data to be desensitized;
Calculate the numerical value length of the numerical chracter, line overrun of going forward side by side processing, to obtain initial value;
Character string is converted by the initial value;
According to the numerical value length and preset numerical value desensitization range, mask is converted by designated character in character string;
After character string after conversion is digitized, output;
When types of variables is character string, the matching replacement method includes:
Extract the character string of the data to be desensitized;
When belonging to preset character string desensitization range there are partial content in the character string, desensitize character string is belonged to The content of range replaces with mask, output;
When types of variables is the time, the matching replacement method includes:
Extract the temporal information of the data to be desensitized;
If temporal information is effective information, convert the temporal information to it is preset by second hour format, to obtain Initial time information;
When belonging to preset time desensitization range there are partial content in the initial time information, taken off the time is belonged to The content of quick range replaces with 0;
After converting the preset standard time for replaced temporal information, output;
When types of variables is regular expression, the matching replacement method includes:
It is identified since a data of the data to be desensitized, until recognizing last of the data to be desensitized Position data;
When there are the data with the regular expression matching, which is replaced with into preset desensitization symbol;
Export replaced data to be desensitized.
Preferably, the k-means analytical unit is specifically used for:
Cluster centre is set;
The data set comprising statistical classification result is obtained, the mean vector of the data set is calculated;
The distance between each data vector and mean vector in the data set are calculated, current distance is defined as;
If current distance is less than preset minimum range, update minimum range is current distance;
The data to be desensitized are grouped again, save the mean vector newly obtained, and average statistical vector number;
Laplace noise is added according to each grouping, which is calculated;
Data to be desensitized after output calculating.
After the system divides grouping using matching replacement algorithm cooperation k-means innovatory algorithm, Laplace is added The mode that noise cooperates jointly realizes the function of data desensitization.By combination person's two ways, flexibly configurable supports most evidences Desensitize algorithm for source, a variety of data.
System provided by the embodiment of the present invention, to briefly describe, embodiment part does not refer to place, can refer to aforementioned side Corresponding contents in method embodiment.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme should all cover within the scope of the claims and the description of the invention.

Claims (10)

1. a kind of desensitization method based on sensitive data, which is characterized in that
Obtain data to be desensitized;
When the data volume for the data that desensitize meets k-means algorithm threshold value, desensitization data are treated using k-means innovatory algorithm It is grouped division, and Laplace noise is added and desensitizes to the data to be desensitized after division;
When the data volume for the data that desensitize meets matching replacement threshold value, desensitization data are treated using matching replacement method and are taken off It is quick;Wherein matching replacement method include according to preset types of variables, using keyword be principle treat desensitize data be replaced.
2. according to claim 1 based on the desensitization method of sensitive data, which is characterized in that
The types of variables includes numerical value, character string, time and regular expression.
3. according to claim 2 based on the desensitization method of sensitive data, which is characterized in that
When types of variables is numerical value, the matching replacement method includes:
Extract the numerical chracter of the data to be desensitized;
Calculate the numerical value length of the numerical chracter, line overrun of going forward side by side processing, to obtain initial value;
Character string is converted by the initial value;
According to the numerical value length and preset numerical value desensitization range, mask is converted by designated character in character string;
After character string after conversion is digitized, output.
4. according to claim 2 based on the desensitization method of sensitive data, which is characterized in that
When types of variables is character string, the matching replacement method includes:
Extract the character string of the data to be desensitized;
When belonging to preset character string desensitization range there are partial content in the character string, desensitize range character string is belonged to Content replace with mask, export.
5. according to claim 2 based on the desensitization method of sensitive data, which is characterized in that
When types of variables is the time, the matching replacement method includes:
Extract the temporal information of the data to be desensitized;
If temporal information is effective information, convert the temporal information to it is preset by second hour format, it is initial to obtain Temporal information;
When belonging to preset time desensitization range there are partial content in the initial time information, desensitize model the time is belonged to The content enclosed replaces with 0;
After converting the preset standard time for replaced temporal information, output.
6. according to claim 2 based on the desensitization method of sensitive data, which is characterized in that
When types of variables is regular expression, the matching replacement method includes:
It is identified since a data of the data to be desensitized, until recognizing the last one digit number of the data to be desensitized According to;
When there are the data with the regular expression matching, which is replaced with into preset desensitization symbol;
Export replaced data to be desensitized.
7. any desensitization method based on sensitive data in -6 according to claim 1, which is characterized in that
It is described desensitization data are treated using k-means innovatory algorithm to be grouped division, and after Laplace noise is added to division Data to be desensitized carry out desensitization specifically include:
Cluster centre is set;
The data set comprising statistical classification result is obtained, the mean vector of the data set is calculated;
The distance between each data vector and mean vector in the data set are calculated, current distance is defined as;
If current distance is less than preset minimum range, update minimum range is current distance;
The data to be desensitized are grouped again, save the mean vector newly obtained, and average statistical vector number;
Laplace noise is added according to each grouping, which is calculated;
Data to be desensitized after output calculating.
8. a kind of desensitization system based on sensitive data characterized by comprising
Acquisition unit: for obtaining data to be desensitized;
K-means analytical unit: for when the data volume for the data that desensitize meets k-means algorithm threshold value, using k-means Innovatory algorithm treats desensitization data and is grouped division, and Laplace noise is added and takes off to the data to be desensitized after division It is quick;
Match replacement unit: for when wait the data that desensitize data volume meet matching replacement threshold value when, using matching replacement method Desensitization data are treated to desensitize;Wherein matching replacement method includes according to preset types of variables, using keyword as principle pair Data to be desensitized are replaced.
9. the desensitization system based on sensitive data according to claim 8, which is characterized in that
The types of variables includes numerical value, character string, time and regular expression;
The matching replacement unit is specifically used for:
When types of variables is numerical value, the matching replacement method includes:
Extract the numerical chracter of the data to be desensitized;
Calculate the numerical value length of the numerical chracter, line overrun of going forward side by side processing, to obtain initial value;
Character string is converted by the initial value;
According to the numerical value length and preset numerical value desensitization range, mask is converted by designated character in character string;
After character string after conversion is digitized, output;
When types of variables is character string, the matching replacement method includes:
Extract the character string of the data to be desensitized;
When belonging to preset character string desensitization range there are partial content in the character string, desensitize range character string is belonged to Content replace with mask, export;
When types of variables is the time, the matching replacement method includes:
Extract the temporal information of the data to be desensitized;
If temporal information is effective information, convert the temporal information to it is preset by second hour format, it is initial to obtain Temporal information;
When belonging to preset time desensitization range there are partial content in the initial time information, desensitize model the time is belonged to The content enclosed replaces with 0;
After converting the preset standard time for replaced temporal information, output;
When types of variables is regular expression, the matching replacement method includes:
It is identified since a data of the data to be desensitized, until recognizing the last one digit number of the data to be desensitized According to;
When there are the data with the regular expression matching, which is replaced with into preset desensitization symbol;
Export replaced data to be desensitized.
10. the desensitization system based on sensitive data according to claim 8 or claim 9, which is characterized in that
The k-means analytical unit is specifically used for:
Cluster centre is set;
The data set comprising statistical classification result is obtained, the mean vector of the data set is calculated;
The distance between each data vector and mean vector in the data set are calculated, current distance is defined as;
If current distance is less than preset minimum range, update minimum range is current distance;
The data to be desensitized are grouped again, save the mean vector newly obtained, and average statistical vector number;
Laplace noise is added according to each grouping, which is calculated;
Data to be desensitized after output calculating.
CN201910486536.1A 2019-06-05 2019-06-05 Desensitization method and system based on sensitive data Pending CN110188571A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910486536.1A CN110188571A (en) 2019-06-05 2019-06-05 Desensitization method and system based on sensitive data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910486536.1A CN110188571A (en) 2019-06-05 2019-06-05 Desensitization method and system based on sensitive data

Publications (1)

Publication Number Publication Date
CN110188571A true CN110188571A (en) 2019-08-30

Family

ID=67720500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910486536.1A Pending CN110188571A (en) 2019-06-05 2019-06-05 Desensitization method and system based on sensitive data

Country Status (1)

Country Link
CN (1) CN110188571A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110795751A (en) * 2019-10-30 2020-02-14 浪潮云信息技术有限公司 Method for carrying out safety protection on sensitive data through natural language analysis
CN111563272A (en) * 2020-04-30 2020-08-21 支付宝实验室(新加坡)有限公司 Information statistical method and device
WO2021042918A1 (en) * 2019-09-02 2021-03-11 深圳壹账通智能科技有限公司 Safe desensitization method and apparatus based on time and date data and computer device
CN112632606A (en) * 2020-12-23 2021-04-09 天津理工大学 SNOMED-CT-based medical text document desensitization method and system
CN115422594A (en) * 2022-09-20 2022-12-02 成都比特信安科技有限公司 Method for realizing data desensitization by using matrix replacement
CN116205236A (en) * 2023-05-06 2023-06-02 四川三合力通科技发展集团有限公司 Data rapid desensitization system and method based on entity naming identification

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN107766577A (en) * 2017-11-15 2018-03-06 北京百度网讯科技有限公司 A kind of public sentiment monitoring method, device, equipment and storage medium
CN108512807A (en) * 2017-02-24 2018-09-07 中国移动通信集团公司 Data desensitization method and data in a kind of data transmission desensitize server
CN108984588A (en) * 2018-05-28 2018-12-11 国政通科技股份有限公司 A kind of data processing method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108512807A (en) * 2017-02-24 2018-09-07 中国移动通信集团公司 Data desensitization method and data in a kind of data transmission desensitize server
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN107766577A (en) * 2017-11-15 2018-03-06 北京百度网讯科技有限公司 A kind of public sentiment monitoring method, device, equipment and storage medium
CN108984588A (en) * 2018-05-28 2018-12-11 国政通科技股份有限公司 A kind of data processing method and device

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021042918A1 (en) * 2019-09-02 2021-03-11 深圳壹账通智能科技有限公司 Safe desensitization method and apparatus based on time and date data and computer device
CN110795751A (en) * 2019-10-30 2020-02-14 浪潮云信息技术有限公司 Method for carrying out safety protection on sensitive data through natural language analysis
CN111563272A (en) * 2020-04-30 2020-08-21 支付宝实验室(新加坡)有限公司 Information statistical method and device
WO2021218660A1 (en) * 2020-04-30 2021-11-04 支付宝实验室(新加坡)有限公司 Information statistics
CN111563272B (en) * 2020-04-30 2021-11-09 支付宝实验室(新加坡)有限公司 Information statistical method and device
CN112632606A (en) * 2020-12-23 2021-04-09 天津理工大学 SNOMED-CT-based medical text document desensitization method and system
CN112632606B (en) * 2020-12-23 2022-12-09 天津理工大学 SNOMED-CT-based medical text document desensitization method and system
CN115422594A (en) * 2022-09-20 2022-12-02 成都比特信安科技有限公司 Method for realizing data desensitization by using matrix replacement
CN116205236A (en) * 2023-05-06 2023-06-02 四川三合力通科技发展集团有限公司 Data rapid desensitization system and method based on entity naming identification
CN116205236B (en) * 2023-05-06 2023-08-18 四川三合力通科技发展集团有限公司 Data rapid desensitization system and method based on entity naming identification

Similar Documents

Publication Publication Date Title
CN110188571A (en) Desensitization method and system based on sensitive data
CN107145799A (en) A kind of data desensitization method and device
WO2017084586A1 (en) Method , system, and device for inferring malicious code rule based on deep learning method
CN105917327A (en) System and method for inputting text into electronic devices
CN110188565A (en) Data desensitization method, device, computer equipment and storage medium
CN104077420B (en) Method and device for importing data into HBase database
CN112765991B (en) Knowledge enhancement-based deep dialogue semantic role labeling method and system
CN109389212A (en) A kind of restructural activation quantization pond system towards low-bit width convolutional neural networks
CN114491525B (en) Android malicious software detection feature extraction method based on deep reinforcement learning
CN110059129A (en) Date storage method, device and electronic equipment
CN110489997A (en) A kind of sensitive information desensitization method based on pattern matching algorithm
CN108829740A (en) Date storage method and device
CN106209366A (en) A kind of data guard method of fail-safe computer
CN111191008A (en) Password guessing method based on numerical factor reverse order
CN105608197B (en) The acquisition methods and system of Memcache data under a kind of high concurrent
CN110362343A (en) The method of the detection bytecode similarity of N-Gram
CN102915344A (en) SQL (structured query language) statement processing method and device
CN109600520A (en) Harassing call number identification method, device and equipment
CN110135184A (en) A kind of method, apparatus, equipment and the storage medium of static data desensitization
CN109800337A (en) A kind of multi-mode canonical matching algorithm suitable for big alphabet
CN103336761B (en) Matching algorithm is filtered in the interference divided based on dynamic with semantic weighting
CN108875390A (en) A kind of shared economic data processing method in community
CN108985759B (en) Address generating method, system, equipment and storage medium for cryptocurrency
CN112765330A (en) Text data processing method and device, electronic equipment and storage medium
CN110457940B (en) Differential privacy measurement method based on graph theory and mutual information quantity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190830