CN115098887B - Anonymization model recommendation method and device for data value optimization - Google Patents

Anonymization model recommendation method and device for data value optimization Download PDF

Info

Publication number
CN115098887B
CN115098887B CN202210921066.9A CN202210921066A CN115098887B CN 115098887 B CN115098887 B CN 115098887B CN 202210921066 A CN202210921066 A CN 202210921066A CN 115098887 B CN115098887 B CN 115098887B
Authority
CN
China
Prior art keywords
data
model
risk
configuration
recommendation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210921066.9A
Other languages
Chinese (zh)
Other versions
CN115098887A (en
Inventor
张罗刚
张宏国
马超
于海宁
孙迎港
颜亭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongshu Shenzhen Times Technology Co ltd
Harbin University of Science and Technology
Original Assignee
Zhongshu Shenzhen Times Technology Co ltd
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongshu Shenzhen Times Technology Co ltd, Harbin University of Science and Technology filed Critical Zhongshu Shenzhen Times Technology Co ltd
Priority to CN202210921066.9A priority Critical patent/CN115098887B/en
Publication of CN115098887A publication Critical patent/CN115098887A/en
Application granted granted Critical
Publication of CN115098887B publication Critical patent/CN115098887B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6263Protecting personal data, e.g. for financial or medical purposes during internet communication, e.g. revealing personal data from cookies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Bioethics (AREA)
  • Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • General Business, Economics & Management (AREA)
  • Quality & Reliability (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an anonymization model recommendation method and device for data value optimization, comprising the following steps: importing original data, and determining a risk threshold of the original data according to the type and the level of the original data; judging whether the anonymization method is forward auxiliary recommendation or reverse active recommendation according to the user demand; matching a group of candidate configuration schemes according to the selected anonymization method respectively, and anonymizing the original data; carrying out risk analysis on the anonymized data, and reserving data conforming to a risk threshold; performing utility analysis on the data, and selecting anonymous data corresponding to the maximum value as output; and adding the result to the historical configuration scheme resource pool. The invention can maximize the data value after anonymizing the data under the premise of ensuring the data security.

Description

Anonymization model recommendation method and device for data value optimization
Technical Field
The application relates to the technical field of data privacy protection, in particular to an anonymization model recommendation method and device for data value optimization.
Background
In the background of rapid development of technologies such as the internet and cloud computing, data elements become fifth largest production elements following land, labor, assets and technologies. Data circulation promotes rapid development of industries such as medical treatment, finance and the like, but at the same time, the risk of data privacy disclosure during data circulation also increases rapidly. Typically, the data is desensitized prior to data flow. But the availability of the data after desensitization will be greatly reduced, i.e. the value of the data will be greatly reduced. Therefore, a method is needed to be found, which can maximize the availability of data and reserve the data value to the maximum extent under the condition of ensuring the data security.
Anonymization technology is one of the main technologies for solving the privacy disclosure problem caused by link attack. The existing anonymization method is mainly characterized in that the original data is generalized and inhibited, so that an attacker cannot identify individuals in the data source. Unlike the general methods of warping, scrambling and randomizing, anonymization techniques preserve the authenticity of data using anonymized data obtained by anonymization techniques. The anonymization model has multiple types and parameters, and potential correlation relations exist among the parameters, so that the data processor is very time-consuming and is very prone to error when performing parameter configuration.
Disclosure of Invention
In view of the above, the application provides an anonymization model recommendation method and device for optimizing data value, which can realize the value maintenance of data elements on the premise of carrying out anonymization processing on data and ensuring the data security, so that the data value is maximized.
In one aspect, the present invention provides an anonymization model recommendation method for data value optimization, including:
Importing original data, and determining a risk threshold r t of the original data according to the type and the level of the original data;
Judging whether the anonymization method is forward auxiliary recommendation or reverse active recommendation according to the user demand;
If the forward auxiliary recommendation is the forward auxiliary recommendation, acquiring configuration parameters p 0, wherein the configuration parameters p 0 comprise a privacy model, privacy model parameters, a suppression limiting rate and attribute weights; according to the KNN algorithm, in a historical configuration scheme resource pool, a group of candidate configuration parameters are automatically matched based on the data characteristic F, the configuration parameter P 0 and the risk threshold r t, and the obtained configuration parameter P 0 and the group of candidate configuration parameters which are automatically matched are recorded as P= [ P 0,p1,p2,p3……,pn ];
If the recommendation is the backward active recommendation, acquiring a set expected value u t of the utility; matching a set of candidate configuration schemes P s=[ps1,ps2,ps3,……,psn in a historical configuration scheme resource pool by using a K-Means algorithm based on the data characteristic F, the risk threshold r t and the expected value u t of the utility;
Anonymizing the original data by using the configuration scheme in the candidate configuration parameter P of the forward auxiliary recommendation or the candidate configuration parameter P s of the reverse active recommendation respectively; the data after anonymizing the original data by using the candidate configuration parameter P of the forward auxiliary recommendation is marked as D= [ D 0,d1,d2,d3……dn ], and the data after anonymizing the original data by using the candidate configuration parameter P s of the reverse active recommendation is marked as D s=[ds1,ds2,ds3,……,dsn ];
Carrying out risk analysis on the data after anonymization processing by sequentially using a corresponding risk model to D or D s, and marking the result after carrying out risk analysis on D as R, wherein R= [ R 0,r1,r2,r3,……,rn ]; the result of risk analysis on D s is designated as R s,Rs=[rs0,rs1,rs2,rs3,……,rsn ];
Comparing the result in R or the result in R s with R t, retaining anonymous rear data smaller than R t, marking the corresponding anonymous rear data in R as D ' = [ D 0',d1',d2',d3'……dn ' ], and marking the corresponding anonymous rear data in R s as D S'=[ds0',ds1',ds2',ds3'……dsn ' ];
Using an accuracy model, a non-uniform entropy model and a resolution model to perform utility analysis on the data in D 'or D S', wherein the analysis result is an average value of the results generated by the accuracy model, the non-uniform entropy model and the resolution model, the analysis result of D 'is recorded as U= [ U 1,u2,u3……un ], and the analysis result of D S' is recorded as U s=[us1,us2,us3,……,usn;
comparing the values in U or U s, and selecting anonymous data corresponding to the maximum value as output; and adding the corresponding risk value r, utility value u, configuration parameter p and data characteristic F of the corresponding original data into a historical configuration scheme resource pool.
In the method, the risk threshold r t may be an average risk threshold r avg and/or a highest risk threshold r h.
In the method, according to the KNN algorithm, in a historical configuration scheme resource pool, a set of candidate configuration parameters are automatically matched based on the data feature F, the configuration parameter p 0 and the risk threshold r t, and the method comprises the following steps:
calculating the distance d between each group of configuration schemes in the historical configuration scheme resource pool according to the data characteristic F, the configuration parameter p 0 and the risk threshold r t;
Ordering the configuration schemes in the historical configuration scheme resource pool according to the increasing sequence of the distances;
Selecting K configuration schemes with minimum distances;
Determining the occurrence frequency of the data types of the first K configuration schemes, and classifying the data type with the highest occurrence frequency in the first K configuration schemes as the predicted data of the configuration schemes;
the configuration scheme in the prediction data classification is used as a set of candidate configuration parameters.
In the method, the risk analysis model includes: a inspector risk model, a reporter risk model, and a marketer risk model;
the anonymizing the processed data, sequentially using a corresponding risk model to perform risk analysis on the D or the D s, including:
and selecting a corresponding risk model to perform risk analysis on D or D s according to the privacy model in the configuration parameters p 0.
In the method, risk thresholds corresponding to the inspector risk model and the reporter risk model comprise an average risk threshold r avg and a highest risk threshold r h;
The risk threshold corresponding to the marketer risk model comprises an average risk threshold r avg.
In the method, the matching a set of candidate configuration schemes P s=[ps1,ps2,ps3,……,psn in the historical configuration scheme resource pool based on the data feature F, the risk threshold r t and the expected value u t of the utility using the K-Means algorithm includes:
clustering schemes with similar attribute characteristics in a historical configuration scheme resource pool by adopting a K-Means algorithm;
Calculating the distance d between the scheme in the history configuration scheme resource pool after the clustering is completed and each group of configuration schemes in the history configuration scheme resource pool after the clustering is completed according to a KNN algorithm based on the data feature F, the risk threshold r t and the expected value u t of the utility;
ordering the configuration schemes in the history configuration scheme resource pool after the clustering is completed according to the increasing sequence of the distances;
Selecting K configuration schemes with minimum distances;
Determining the occurrence frequency of the data types of the first K configuration schemes, and classifying the data type with the highest occurrence frequency in the first K configuration schemes as the predicted data of the configuration schemes;
the configuration scheme in the prediction data classification is used as a set of candidate configuration parameters.
In the method, the data feature F includes: data table field semantic features, data table field type features, attribute type features, and quantity features of corresponding attributes.
On the other hand, the application also provides an anonymization model recommendation device for optimizing the data value, which comprises the following steps:
The data importing unit is used for importing the original data and determining a risk threshold r t of the original data according to the type and the level of the original data;
the model recommending unit is used for judging whether the anonymizing method is forward auxiliary recommending or reverse active recommending according to the user demand;
The forward auxiliary recommendation unit is used for acquiring configuration parameters p 0 when the forward auxiliary recommendation is judged, wherein the configuration parameters p 0 comprise a privacy model, privacy model parameters, inhibition limiting rate and attribute weights; according to the KNN algorithm, in a historical configuration scheme resource pool, a group of candidate configuration parameters are automatically matched based on the data characteristic F, the configuration parameter P 0 and the risk threshold r t, and the obtained configuration parameter P 0 and the group of candidate configuration parameters which are automatically matched are recorded as P= [ P 0,p1,p2,p3……,pn ];
The backward active recommendation unit is used for acquiring the expected value u t of the set utility when the backward active recommendation is judged; matching a set of candidate configuration schemes P s=[ps1,ps2,ps3,……,psn in a historical configuration scheme resource pool by using a K-Means algorithm based on the data characteristic F, the risk threshold r t and the expected value u t of the utility;
The anonymizing processing unit is used for anonymizing the original data by using the configuration scheme in the candidate configuration parameter P of forward auxiliary recommendation or the candidate configuration parameter P s of reverse active recommendation respectively; the data after anonymizing the original data by using the candidate configuration parameter P of forward auxiliary recommendation is marked as D, and the data after anonymizing the original data by using the candidate configuration parameter P s of reverse active recommendation is marked as D s;
The risk analysis unit is used for sequentially carrying out risk analysis on the data after anonymization processing by using a corresponding risk model and marking the result after carrying out the risk analysis on the data as R, R= [ R 0,r1,r2,r3,……,rn ]; the result of risk analysis on D s is designated as R s,Rs=[rs0,rs1,rs2,rs3,……,rsn ];
Comparing the result in R or the result in R s with R t, retaining anonymous post-data smaller than R t, marking the corresponding anonymous post-data in R as D ', and marking the corresponding anonymous post-data in R s as D S';
The utility analysis unit is used for carrying out utility analysis on the data in the D 'or the D S' by using the accuracy model, the non-uniform entropy model and the resolution model, wherein the analysis result is an average value of the results generated by the accuracy model, the non-uniform entropy model and the resolution model, the analysis result of the D 'is marked as U, and the analysis result of the D S' is marked as U s;
The result output unit is used for comparing the values in the U or the U s and selecting anonymous data corresponding to the maximum value as output; and adding the corresponding risk value r, utility value u, configuration parameter p and data characteristic F of the corresponding original data into a historical configuration scheme resource pool.
In still another aspect, the present invention further proposes an anonymization model recommendation device for data value optimization, including: a processor and a memory;
The memory is used for storing a computer program;
The processor is configured to execute a program in a memory and implement the anonymization model recommendation method for data value optimization according to any of claims 1-7.
The invention also proposes a storage medium for storing at least one set of instructions;
the set of instructions is for being invoked and at least performing the anonymization model recommendation method for data value optimization according to any of claims 1-7.
The method provided by the invention has the advantages that the forward auxiliary recommendation and the reverse active recommendation are adopted, the anonymization model and the related parameter recommendation can be carried out according to different requirements, and the value of the data elements can be maintained on the premise of ensuring the data safety, so that the data value is maximized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of an embodiment of an anonymization model recommendation method for data value optimization according to the present invention;
FIG. 2 is a schematic diagram of an anonymization model recommendation device for data value optimization;
FIG. 3 is a schematic diagram of an anonymization model recommendation device for data value optimization according to the present invention.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In the background of rapid development of technologies such as the internet and cloud computing, data elements become fifth largest production elements following land, labor, assets and technologies. Data circulation promotes rapid development of industries such as medical treatment, finance and the like, but at the same time, the risk of data privacy disclosure during data circulation also increases rapidly. Typically, the data is desensitized prior to data flow. But the availability of the data after desensitization will be greatly reduced, i.e. the value of the data will be greatly reduced. Therefore, a method is needed to be found, which can maximize the availability of data and reserve the data value to the maximum extent under the condition of ensuring the data security.
Anonymization technology is one of the main technologies for solving the privacy disclosure problem caused by link attack. The existing anonymization method is mainly characterized in that the original data is generalized and inhibited, so that an attacker cannot identify individuals in the data source. Unlike the general methods of warping, scrambling and randomizing, anonymization techniques preserve the authenticity of data using anonymized data obtained by anonymization techniques.
For the data processor, the value of the data elements can be kept by using the data anonymization method, so that the value of the data elements which can be circulated is optimized by selecting a proper anonymization model and reasonably configuring model parameters as much as possible. In the concept of anonymization techniques, the value of data is closely related to the utility of the data, as the criteria for the value measure of the data is derived from the value of the use of the data. The utility of data is typically measured by the amount of information lost to the anonymized data resulting from anonymizing the original data. In addition to the method of statistical index, the method of utility analysis model can be applied to measure. The key point in optimizing the data value is how to maximize the utility of the data, i.e., minimize the amount of information lost to the data after anonymization.
When using anonymization techniques, the use of anonymization models is necessarily faced with. The anonymization model has multiple types and parameters, and potential correlation relations exist among the parameters, so that the data processor is very time-consuming and is very prone to error when performing parameter configuration. Meanwhile, after single parameter configuration is completed, optimization of data value is generally difficult to achieve under the condition that basic risk protection requirements are met, and repeated iteration is often needed to adjust parameters and execute anonymization models. For this scenario, an intelligent method is urgently needed to assist the data depositors in efficiently completing the data anonymization processing task.
Therefore, the anonymization model recommending method and the anonymization model recommending equipment for data value optimization can realize the value maintenance of the data elements on the premise of carrying out anonymization processing on the data and guaranteeing the data safety, so that the data value is maximized.
In one aspect of the embodiment of the present application, a recommendation method for anonymizing models for optimizing data value is provided, as shown in fig. 1, including:
S101: importing original data, and determining a risk threshold r t of the original data according to the type and the level of the original data;
S102: judging whether the anonymization method is forward auxiliary recommendation or reverse active recommendation according to the user demand;
The data processor may choose to enable the system's forward auxiliary recommendation process or enable the backward active recommendation process based on its own needs. For example, the data processor has a certain anonymization technical field knowledge and has a relatively rich anonymization model use experience, a forward auxiliary recommendation process is adopted, and a reverse active recommendation process is adopted otherwise;
S103: if the forward auxiliary recommendation is the forward auxiliary recommendation, acquiring configuration parameters p 0, wherein the configuration parameters p 0 comprise a privacy model, privacy model parameters, a suppression limiting rate and attribute weights; such as p 0 = { privacyModel: "kAnonymity", "kValue":2, "suppressionLimit":0.2, "weight":0.5}; according to the KNN algorithm, in a historical configuration scheme resource pool, a group of candidate configuration parameters are automatically matched based on the data characteristic F, the configuration parameter P 0 and the risk threshold r t, and the obtained configuration parameter P 0 and the group of candidate configuration parameters which are automatically matched are recorded as P= [ P 0,p1,p2,p3……,pn ]; the historical configuration scheme resource pool mainly stores configuration parameters and other information of a conventional anonymization model; the data feature F is used for representing the feature of the data, and the data feature of one original data has various types, including semantic features of fields of a data table, type features of fields of the data table, attribute type features, quantity features of corresponding attributes and the like. For example, F= { age: { type: "Integer", attributeType: "quad-identifying" } }, where "age" is a data table field semantic feature, "Integer" is a data table field type feature, and "quad-identifying" is an attribute type. There are four types of attributes, namely, "identifying", "quasi-identifying", "active", "intrinsic". The configuration parameters comprise a privacy model, parameters corresponding to the privacy model, suppression limiting rate of data, weight of attributes and the like.
For example, p= { privacyModel: "kAnonymity", "kValue":2 "," suppressionLimit ": 0.2", "weight":0.5}, where "kAnonymity" is the privacy model, "kValue" is the privacy model parameter, "suppressionLimit" is the suppression restriction rate, and "weight" is the attribute weight.
S104: if the recommendation is the backward active recommendation, acquiring a set expected value u t of the utility; matching a set of candidate configuration schemes P s=[ps1,ps2,ps3,……,psn in a historical configuration scheme resource pool by using a K-Means algorithm based on the data characteristic F, the risk threshold r t and the expected value u t of the utility;
S105: anonymizing the original data by using the configuration scheme in the candidate configuration parameter P of the forward auxiliary recommendation or the candidate configuration parameter P s of the reverse active recommendation respectively; the data after anonymizing the original data by using the candidate configuration parameter P of the forward auxiliary recommendation is marked as D= [ D 0,d1,d2,d3……dn ], and the data after anonymizing the original data by using the candidate configuration parameter P s of the reverse active recommendation is marked as D s=[ds1,ds2,ds3,……,dsn ];
S106: carrying out risk analysis on the data after anonymization processing by sequentially using a corresponding risk model to D or D s, and marking the result after carrying out risk analysis on D as R, wherein R= [ R 0,r1,r2,r3,……,rn ]; the result of risk analysis on D s is designated as R s,Rs=[rs0,rs1,rs2,rs3,……,rsn ];
S107: comparing the result in R or the result in R s with R t, retaining anonymous rear data smaller than R t, marking the corresponding anonymous rear data in R as D ' = [ D 0',d1',d2',d3'……dn ' ], and marking the corresponding anonymous rear data in R s as D S'=[ds0',ds1',ds2',ds3'……dsn ' ];
S108: using an accuracy model, a non-uniform entropy model and a resolution model to perform utility analysis on the data in D 'or D S', wherein the analysis result is an average value of the results generated by the accuracy model, the non-uniform entropy model and the resolution model, the analysis result of D 'is recorded as U= [ U 1,u2,u3……un ], and the analysis result of D S' is recorded as U s=[us1,us2,us3,……,usn;
S109: comparing the values in U or U s, and selecting anonymous data corresponding to the maximum value as output; and adding the corresponding risk value r, utility value u, configuration parameter p and data characteristic F of the corresponding original data into a historical configuration scheme resource pool.
Generally, we intuitively calculate the utility of the data using the following three utility analysis models, accuracy model (Precision), non-uniform entropy model (NUEntropy), resolution model (Discernibility), respectively.
Precision evaluation criteria measure the generalization intensity of attribute values in a dataset according to the generalization hierarchy of each attribute, the lower the generalization intensity, the higher the accuracy, i.e., the higher the availability of data. The NUEntropy method focuses on quantifying differences in attribute value distribution, the smaller the differences, the higher the data availability. Discernibility focuses on measuring the size of indistinguishable record groups, the higher the resolution, the higher the data availability. The results obtained by calculation using the three utility analysis models are respectively recorded as Pre,Disc, the higher these three values, the higher the availability of the representative data. And finally, taking the average value of the three as a data availability judging standard, wherein the average value is shown in the following formula:
The anonymized data corresponding to the maximum utility value U is found out, so that the data with the highest value can be found out.
In the method, the risk threshold r t may be an average risk threshold r avg and/or a highest risk threshold r h.
In the method, according to the KNN algorithm, in a historical configuration scheme resource pool, a set of candidate configuration parameters are automatically matched based on the data feature F, the configuration parameter p 0 and the risk threshold r t, and the method comprises the following steps:
calculating the distance d between each group of configuration schemes in the historical configuration scheme resource pool according to the data characteristic F, the configuration parameter p 0 and the risk threshold r t;
Ordering the configuration schemes in the historical configuration scheme resource pool according to the increasing sequence of the distances;
Selecting K configuration schemes with minimum distances;
Determining the occurrence frequency of the data types of the first K configuration schemes, and classifying the data type with the highest occurrence frequency in the first K configuration schemes as the predicted data of the configuration schemes;
the configuration scheme in the prediction data classification is used as a set of candidate configuration parameters.
In the method, the risk analysis model includes: a inspector risk model, a reporter risk model, and a marketer risk model;
the anonymizing the processed data, sequentially using a corresponding risk model to perform risk analysis on the D or the D s, including:
and selecting a corresponding risk model to perform risk analysis on D or D s according to the privacy model in the configuration parameters p 0.
Wherein in the inspection officer risk model, an attacker attacks against a particular individual and it is assumed that the attacker already knows that the data about the individual is contained in the data set. In the reporter model, an attacker attacks against a particular individual, but it is assumed that the attacker does not have background knowledge of the relevant individual. In the marketer risk model, the attacker is not directed to a particular individual, but the goal of the attacker is to re-identify a large amount of personal information. Attacks are only successful if a large number of records are re-identified.
For a common privacy model, the risk scenario for which is different. Such as k-Anonymity, l-DIVERSITY, T-Closeness, delta-Disclosure privacy, beta-Likeness are suitable for use in examining official risk scenes. The delta-Presence and the k-Map are suitable for a reporter risk scene. AVERAGE RISK, population uniqueness are suitable for marketer risk scenarios.
In the method, risk thresholds corresponding to the inspector risk model and the reporter risk model comprise an average risk threshold r avg and a highest risk threshold r h;
The risk threshold corresponding to the marketer risk model includes an average risk threshold r avg, and in each case the value is equal to the inspector average risk threshold or the reporter average risk threshold.
In the method, the matching a set of candidate configuration schemes P s=[ps1,ps2,ps3,……,psn in the historical configuration scheme resource pool based on the data feature F, the risk threshold r t and the expected value u t of the utility using the K-Means algorithm includes:
Clustering schemes with similar attribute characteristics in a historical configuration scheme resource pool by adopting a K-Means algorithm; the K-means algorithm uses K as a parameter to divide n objects into K clusters so that the clusters have higher similarity and the clusters have lower similarity. The specific implementation process is as follows: (1) randomly selecting k schemes with different data characteristics from a historical configuration scheme resource pool as initial clustering centers, (2) classifying the rest of n-k schemes into clusters closest to each other, (3) recalculating the center of each cluster, (4) repeating the steps (2) and (3) until the center of each cluster is not changed.
Calculating the distance d between the scheme in the history configuration scheme resource pool after the clustering is completed and each group of configuration schemes in the history configuration scheme resource pool after the clustering is completed according to a KNN algorithm based on the data feature F, the risk threshold r t and the expected value u t of the utility;
ordering the configuration schemes in the history configuration scheme resource pool after the clustering is completed according to the increasing sequence of the distances;
Selecting K configuration schemes with minimum distances;
Determining the occurrence frequency of the data types of the first K configuration schemes, and classifying the data type with the highest occurrence frequency in the first K configuration schemes as the predicted data of the configuration schemes;
the configuration scheme in the prediction data classification is used as a set of candidate configuration parameters.
In the method, the data feature F includes: data table field semantic features, data table field type features, attribute type features, and quantity features of corresponding attributes.
In summary, the core problem of the application is to maximize the value maintenance of the data elements on the premise of ensuring the data security, and to solve the problem, the following mathematical model can be established:
Where u max denotes the maximum utility of the data, F (x) is the utility of the data obtained after anonymization, F denotes the data feature, p denotes the configured parameter, r denotes the re-identification risk of the data calculated using the risk model, and r t denotes the risk threshold.
The specific relation between the data utility and anonymization processing process and data feature configuration parameters in the formula (1) is shown in the formula (2):
Equation (2) is also a formal representation of "a historical configuration resource pool", one row of which corresponds to one instance of "a historical configuration resource pool". The input value of the formula is the characteristic F and the configuration parameter p of the data, and the output value is the utility u of the anonymized data.
For a better understanding of the method of the present invention, the following will exemplify the inventive scheme by taking the actual situation as an example:
a data set contains nine attributes, sex, age, race, marital-status, education, native-country, workclass, occupation, salary-class, respectively. The data table is marked as T, 30162 records are shared in T, and nine attributes are quasi-identifier attributes. There are 5000 existing configurations in the historical configuration scheme resource pool.
If the data processor has enough anonymization technical knowledge background and wants to maximize the value of the anonymized data on the basis of ensuring that the anonymized data meets the requirements, the data processor can select to apply an active auxiliary recommendation process to anonymize the data set.
The data processor imports the original data and marks the category of the original data as 'financial data', and according to the related standard of the financial data, the highest risk threshold r t max is 20%, and the average risk threshold r tave is 5%.
First, the configuration related parameter is p 0 = { privacyModel: "kAnonymity", "kValue":7, "suppressionLimit":0, "weight":0.5}.
The basic feature that can be analyzed to derive data is F, where the types of the remaining fields are all "String" except that the type of "age" is the "Integer" type. All fields have the attribute "quasi-identifying", i.e., the quasi-identifier type. Based on KNN algorithm, a group of parameters automatically recommended by the system is P= [ P 1,p2,p3 ],
p1={privacyModel:"kAnonymity","kValue":3,"suppressionLimit":0,"weight":0.5},
p2={privacyModel:"kAnonymity","kValue":5,"suppressionLimit":0,"weight":0.5},
p3={privacyModel:"kAnonymity","kValue":10,"suppressionLimit":0,"weight":0.5}。
And anonymizing the data by using the four groups of parameters to obtain anonymized data which is D= [ D 0,d1,d2,d3 ]. And (3) performing risk analysis on the data in d 0 by using a inspector risk model. Under the inspector risk model, the average inspector risk of the data set is 1.88825%, the highest inspector risk is 14.28571%, and the requirement of the risk threshold is met. d 1 has an average inspector risk of 4.74446% and a highest inspector risk of 33.3333%. d 2 has an average inspector risk of 3.51586% and a highest inspector risk of 20%. d 3 has a highest risk of 2.22793% and an average risk of 10%. Except d 1, the risk of re-identification of all the remaining anonymized data is below the risk threshold requirement.
Further, utility analysis (D 1 was removed) was performed on the data in D, and the measured utility is shown in table 1.
Table1 data utility table
Data set Prec NUE average D U
d0 54.0329% 33.2189% 87.6857% 58.3125%
d2 56.3543% 69.0878% 85.5898% 70.3440%
d3 50.8765% 35.1182% 80.5739% 55.5032%
As can be seen from the data utility table, the data utility in d 2 is the largest, so that the data d 2 after anonymization of the configuration p 2 which is automatically recommended is selected as the optimal solution to be output.
On the other hand, if the data processing lacks sufficient knowledge of the anonymization technology field and the anonymization model is used less frequently, the anonymization processing of the same data is required, and the desire for the final data utility needs to be set on the basis of importing the original data and designating the data type and level. Assuming that the utility of the data processor setting needs to be at least 65%, then based on the expectations set by the data processor, and the characteristics of the analyzed data, the system clusters the configuration schemes in the historical configuration scheme resource pool based on K-means, and then recommends a set of configuration parameters using the KNN algorithm as shown in table 2.
Table 2 KNN configuration after Algorithm screening
The data set calculated by the corresponding configuration is D '= [ D' 1,d'2,d'3,d'4 ], and after all the data sets are subjected to utility analysis and risk analysis, the corresponding utility values and risk values are shown in table 3:
Table 3 data utility table
Data set Prec NUE average D U Average risk Highest risk
d’1 56.3543% 69.0878% 85.5898% 70.3440% 3.4969% 20%
d’2 55.2491% 35.6888% 88.1271% 59.6883% 2.2174% 16.6667%
d’3 54.0329% 33.2189% 87.6857% 58.3125% 1.8825% 10%
d’4 53.7245% 34.2785% 85.6344% 57.8791% 1.8638% 12.5%
As can be seen from table 3, all risk of anonymized data meets the requirements. And selecting the data set with the largest utility value U as output, namely output d' 1. And storing the data into a historical configuration scheme resource pool together with the corresponding data characteristics, configuration parameter information, risk information and utility information.
On the other hand, the application also provides an anonymization model recommendation device for optimizing the data value, as shown in fig. 2, which comprises the following steps:
A data importing unit 201, configured to import original data, and determine a risk threshold r t of the original data according to a type and a level of the original data;
the model recommending unit 202 is configured to determine, according to a user requirement, whether the anonymizing method is forward auxiliary recommendation or reverse active recommendation;
The forward auxiliary recommendation unit 203 is configured to obtain a configuration parameter p 0 when the forward auxiliary recommendation is determined, where the configuration parameter p 0 includes a privacy model, a privacy model parameter, a suppression restriction rate, and an attribute weight; according to the KNN algorithm, in a historical configuration scheme resource pool, a group of candidate configuration parameters are automatically matched based on the data characteristic F, the configuration parameter P 0 and the risk threshold r t, and the obtained configuration parameter P 0 and the group of candidate configuration parameters which are automatically matched are recorded as P= [ P 0,p1,p2,p3……,pn ];
The inverse active recommendation unit 204 is configured to obtain an expected value u t of the set utility when determining that inverse active recommendation is performed; matching a set of candidate configuration schemes P s=[ps1,ps2,ps3,……,psn in a historical configuration scheme resource pool by using a K-Means algorithm based on the data characteristic F, the risk threshold r t and the expected value u t of the utility;
The anonymizing processing unit 205 is configured to anonymize the original data by using the configuration scheme in the candidate configuration parameter P of forward auxiliary recommendation or the candidate configuration parameter P s of backward active recommendation; the data after anonymizing the original data by using the candidate configuration parameter P of forward auxiliary recommendation is marked as D, and the data after anonymizing the original data by using the candidate configuration parameter P s of reverse active recommendation is marked as D s;
A risk analysis unit 206, configured to sequentially perform risk analysis on D or D s on the anonymized data using the corresponding risk model, and record a result of performing risk analysis on D as R, r= [ R 0,r1,r2,r3,……,rn ]; the result of risk analysis on D s is designated as R s,Rs=[rs0,rs1,rs2,rs3,……,rsn ];
Comparing the result in R or the result in R s with R t, retaining anonymous post-data smaller than R t, marking the corresponding anonymous post-data in R as D ', and marking the corresponding anonymous post-data in R s as D S';
A utility analysis unit 207, configured to perform utility analysis on the data in D 'or D S' using the accuracy model, the non-uniform entropy model, and the resolution model, where the analysis result is an average value of results generated by the accuracy model, the non-uniform entropy model, and the resolution model, the analysis result of D 'is denoted as U, and the analysis result of D S' is denoted as U s;
A result output unit 208, configured to compare values in U or U s, and select anonymous data corresponding to the maximum value as output; and adding the corresponding risk value r, utility value u, configuration parameter p and data characteristic F of the corresponding original data into a historical configuration scheme resource pool.
In still another aspect, the present invention further proposes an anonymization model recommendation device for data value optimization, as shown in fig. 3, including: a processor 301 and a memory 302;
The memory is used for storing a computer program;
The processor is configured to execute a program in a memory and implement the anonymization model recommendation method for data value optimization according to any of claims 1-7.
The invention also proposes a storage medium for storing at least one set of instructions;
the set of instructions is for being invoked and at least performing the anonymization model recommendation method for data value optimization according to any of claims 1-7.
The method provided by the invention has the advantages that the forward auxiliary recommendation and the reverse active recommendation are adopted, the anonymization model and the related parameter recommendation can be carried out according to different requirements, and the value of the data elements can be maintained on the premise of ensuring the data safety, so that the data value is maximized.
The foregoing examples are provided for further details of the purpose, technical scheme and beneficial effects of the present application, and it should be understood that the foregoing is only illustrative of the present application and is not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements, etc. based on the technical scheme of the present application should be included in the scope of the present application.

Claims (10)

1. An anonymization model recommendation method for data value optimization, comprising:
Importing original data, and determining a risk threshold r t of the original data according to the type and the level of the original data;
Judging whether the anonymization method is forward auxiliary recommendation or reverse active recommendation according to the user demand;
If the forward auxiliary recommendation is the forward auxiliary recommendation, acquiring configuration parameters p 0, wherein the configuration parameters p 0 comprise a privacy model, privacy model parameters, a suppression limiting rate and attribute weights; according to the KNN algorithm, in a historical configuration scheme resource pool, a group of candidate configuration parameters are automatically matched based on the data characteristic F, the configuration parameter P 0 and the risk threshold r t, and the obtained configuration parameter P 0 and the group of candidate configuration parameters which are automatically matched are recorded as P= [ P 0,p1,p2,p3……,pn ];
If the recommendation is the backward active recommendation, acquiring a set expected value u t of the utility; matching a set of candidate configuration schemes P s=[ps1,ps2,ps3,……,psn in a historical configuration scheme resource pool by using a K-Means algorithm based on the data characteristic F, the risk threshold r t and the expected value u t of the utility;
Anonymizing the original data by using the configuration scheme in the candidate configuration parameter P of the forward auxiliary recommendation or the candidate configuration parameter P s of the reverse active recommendation respectively; the data after anonymizing the original data by using the candidate configuration parameter P of forward auxiliary recommendation is marked as D, and the data after anonymizing the original data by using the candidate configuration parameter P s of reverse active recommendation is marked as D s;
Carrying out risk analysis on the data after anonymization processing by sequentially using a corresponding risk model to D or D s, and marking the result after carrying out risk analysis on D as R, wherein R= [ R 0,r1,r2,r3,……,rn ]; the result of risk analysis on D s is designated as R s,Rs=[rs0,rs1,rs2,rs3,……,rsn ];
Comparing the result in R or the result in R s with R t, retaining anonymous rear data smaller than R t, marking the corresponding anonymous rear data in R as D ', and marking the corresponding anonymous rear data in R s as D S ';
using an accuracy model, a non-uniform entropy model and a resolution model to perform utility analysis on the data in D ' or D S ', wherein the analysis result is an average value of the results generated by the accuracy model, the non-uniform entropy model and the resolution model, the analysis result of D ' is marked as U, and the analysis result of D S ' is marked as U s;
comparing the values in U or U s, and selecting anonymous data corresponding to the maximum value as output; and adding the corresponding risk value r, utility value u, configuration parameter p and data characteristic F of the corresponding original data into a historical configuration scheme resource pool.
2. The method of claim 1, wherein,
The risk threshold r t may be an average risk threshold r avg and/or a highest risk threshold r h.
3. The method according to claim 1, wherein automatically matching a set of candidate configuration parameters in the historical configuration scheme resource pool based on the data feature F, the configuration parameter p 0, and the risk threshold r t according to the KNN algorithm comprises:
calculating the distance d between each group of configuration schemes in the historical configuration scheme resource pool according to the data characteristic F, the configuration parameter p 0 and the risk threshold r t;
Ordering the configuration schemes in the historical configuration scheme resource pool according to the increasing sequence of the distances;
Selecting K configuration schemes with minimum distances;
Determining the occurrence frequency of the data types of the first K configuration schemes, and classifying the data type with the highest occurrence frequency in the first K configuration schemes as the predicted data of the configuration schemes;
the configuration scheme in the prediction data classification is used as a set of candidate configuration parameters.
4. The method of claim 2, wherein the risk analysis model comprises: a inspector risk model, a reporter risk model, and a marketer risk model;
the anonymizing the processed data, sequentially using a corresponding risk model to perform risk analysis on the D or the D s, including:
and selecting a corresponding risk model to perform risk analysis on D or D s according to the privacy model in the configuration parameters p 0.
5. The method of claim 4, wherein the risk thresholds for the inspector risk model and the reporter risk model include an average risk threshold r avg and a highest risk threshold r h;
The risk threshold corresponding to the marketer risk model comprises an average risk threshold r avg.
6. The method of claim 1, wherein the matching a set of candidate configuration schemes P s=[ps1,ps2,ps3,……,psn in the historical configuration scheme resource pool using the K-Means algorithm based on the data characteristic F, the risk threshold r t, and the expected value of utility u t comprises:
clustering schemes with similar attribute characteristics in a historical configuration scheme resource pool by adopting a K-Means algorithm;
Calculating the distance d between the scheme in the history configuration scheme resource pool after the clustering is completed and each group of configuration schemes in the history configuration scheme resource pool after the clustering is completed according to a KNN algorithm based on the data feature F, the risk threshold r t and the expected value u t of the utility;
ordering the configuration schemes in the history configuration scheme resource pool after the clustering is completed according to the increasing sequence of the distances;
Selecting K configuration schemes with minimum distances;
Determining the occurrence frequency of the data types of the first K configuration schemes, and classifying the data type with the highest occurrence frequency in the first K configuration schemes as the predicted data of the configuration schemes;
the configuration scheme in the prediction data classification is used as a set of candidate configuration parameters.
7. The method of claim 1, wherein the data feature F comprises: data table field semantic features, data table field type features, attribute type features, and quantity features of corresponding attributes.
8. An anonymizing model recommendation device for data value optimization, comprising:
The data importing unit is used for importing the original data and determining a risk threshold r t of the original data according to the type and the level of the original data;
the model recommending unit is used for judging whether the anonymizing method is forward auxiliary recommending or reverse active recommending according to the user demand;
The forward auxiliary recommendation unit is used for acquiring configuration parameters p 0 when the forward auxiliary recommendation is judged, wherein the configuration parameters p 0 comprise a privacy model, privacy model parameters, inhibition limiting rate and attribute weights; according to the KNN algorithm, in a historical configuration scheme resource pool, a group of candidate configuration parameters are automatically matched based on the data characteristic F, the configuration parameter P 0 and the risk threshold r t, and the obtained configuration parameter P 0 and the group of candidate configuration parameters which are automatically matched are recorded as P= [ P 0,p1,p2,p3……,pn ];
The backward active recommendation unit is used for acquiring the expected value u t of the set utility when the backward active recommendation is judged; matching a set of candidate configuration schemes P s=[ps1,ps2,ps3,……,psn in a historical configuration scheme resource pool by using a K-Means algorithm based on the data characteristic F, the risk threshold r t and the expected value u t of the utility;
The anonymizing processing unit is used for anonymizing the original data by using the configuration scheme in the candidate configuration parameter P of forward auxiliary recommendation or the candidate configuration parameter P s of reverse active recommendation respectively; the data after anonymizing the original data by using the candidate configuration parameter P of forward auxiliary recommendation is marked as D, and the data after anonymizing the original data by using the candidate configuration parameter P s of reverse active recommendation is marked as D s;
The risk analysis unit is used for sequentially carrying out risk analysis on the data after anonymization processing by using a corresponding risk model and marking the result after carrying out the risk analysis on the data as R, R= [ R 0,r1,r2,r3,……,rn ]; the result of risk analysis on D s is designated as R s,Rs=[rs0,rs1,rs2,rs3,……,rsn ];
Comparing the result in R or the result in R s with R t, retaining anonymous rear data smaller than R t, marking the corresponding anonymous rear data in R as D ', and marking the corresponding anonymous rear data in R s as D S ';
The utility analysis unit is used for carrying out utility analysis on the data in D ' or D S ' by using the accuracy model, the non-uniform entropy model and the resolution model, wherein the analysis result is an average value of the results generated by the accuracy model, the non-uniform entropy model and the resolution model, the analysis result of D ' is marked as U, and the analysis result of D S ' is marked as U s;
The result output unit is used for comparing the values in the U or the U s and selecting anonymous data corresponding to the maximum value as output; and adding the corresponding risk value r, utility value u, configuration parameter p and data characteristic F of the corresponding original data into a historical configuration scheme resource pool.
9. An anonymized model recommendation device for data value optimization, comprising: a processor and a memory;
The memory is used for storing a computer program;
The processor is configured to execute a program in a memory and implement the anonymization model recommendation method for data value optimization according to any of claims 1-7.
10. A storage medium for storing at least one set of instructions;
the set of instructions is for being invoked and at least performing the anonymization model recommendation method for data value optimization according to any of claims 1-7.
CN202210921066.9A 2022-08-02 2022-08-02 Anonymization model recommendation method and device for data value optimization Active CN115098887B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210921066.9A CN115098887B (en) 2022-08-02 2022-08-02 Anonymization model recommendation method and device for data value optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210921066.9A CN115098887B (en) 2022-08-02 2022-08-02 Anonymization model recommendation method and device for data value optimization

Publications (2)

Publication Number Publication Date
CN115098887A CN115098887A (en) 2022-09-23
CN115098887B true CN115098887B (en) 2024-08-06

Family

ID=83299929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210921066.9A Active CN115098887B (en) 2022-08-02 2022-08-02 Anonymization model recommendation method and device for data value optimization

Country Status (1)

Country Link
CN (1) CN115098887B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170087423A (en) * 2016-01-20 2017-07-28 (주)라이앤캐처스 Method for personalized book recommendation, and system of the same
CN113902303A (en) * 2021-10-12 2022-01-07 哈尔滨工业大学 Privacy model automatic recommendation system, algorithm, equipment and storage medium based on user satisfaction

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110458663B (en) * 2019-08-06 2020-06-02 上海新共赢信息科技有限公司 Vehicle recommendation method, device, equipment and storage medium
KR102394229B1 (en) * 2020-06-09 2022-05-04 (주)뤼이드 Learning contents recommendation system based on artificial intelligence learning and operation method thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170087423A (en) * 2016-01-20 2017-07-28 (주)라이앤캐처스 Method for personalized book recommendation, and system of the same
CN113902303A (en) * 2021-10-12 2022-01-07 哈尔滨工业大学 Privacy model automatic recommendation system, algorithm, equipment and storage medium based on user satisfaction

Also Published As

Publication number Publication date
CN115098887A (en) 2022-09-23

Similar Documents

Publication Publication Date Title
Templ et al. Statistical disclosure control for micro-data using the R package sdcMicro
EP3690677B1 (en) Differentially private query budget refunding
EP3736723B1 (en) Differentially private budget tracking using renyi divergence
US10423803B2 (en) Smart suppression using re-identification risk measurement
CN111724238B (en) Method, device and equipment for evaluating product recommendation accuracy and storage medium
CN112259210B (en) Medical big data access control method and device and computer readable storage medium
Erpolat Taşabat A Novel Multicriteria Decision‐Making Method Based on Distance, Similarity, and Correlation: DSC TOPSIS
Tong et al. Learning fractional white noises in neural stochastic differential equations
CN115098887B (en) Anonymization model recommendation method and device for data value optimization
CN116843392A (en) Recommendation method, recommendation device, recommendation equipment and storage medium
CN113902303B (en) Privacy model automatic recommendation system, algorithm, equipment and storage medium based on user satisfaction
CN114238280B (en) Method and device for constructing financial sensitive information standard library and electronic equipment
CN115936841A (en) Method and device for constructing credit risk assessment model
CN113704236A (en) Government affair system data quality evaluation method, device, terminal and storage medium
Kuznietsova et al. Business intelligence techniques for missing data imputation
CN113221966A (en) Differential privacy decision tree construction method based on F _ Max attribute measurement
CN114741726B (en) Data processing method and device based on privacy protection and electronic equipment
Bhat et al. A privacy preserved data mining approach based on k-partite graph theory
CN111429232A (en) Product recommendation method and device, electronic equipment and computer-readable storage medium
Alghamedy et al. Imputing trust network information in NMF-based collaborative filtering
CN113434897B (en) Differential privacy histogram publishing method and system giving priority to keg availability
CN114817977B (en) Anonymous protection method based on sensitive attribute value constraint
JP7219734B2 (en) Evaluation device, evaluation method and evaluation program
KR102289236B1 (en) Method and apparatus for drawing evaluation reasons from credit evaluation model
CN109241404B (en) Information sharing method, computer readable storage medium and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant