CN116992488A - Differential privacy protection method and system - Google Patents

Differential privacy protection method and system Download PDF

Info

Publication number
CN116992488A
CN116992488A CN202311245211.7A CN202311245211A CN116992488A CN 116992488 A CN116992488 A CN 116992488A CN 202311245211 A CN202311245211 A CN 202311245211A CN 116992488 A CN116992488 A CN 116992488A
Authority
CN
China
Prior art keywords
data
initial
clustering
data set
exchange
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311245211.7A
Other languages
Chinese (zh)
Other versions
CN116992488B (en
Inventor
张荣泽
孙绍宁
张升太
展昭生
袁梦晓
高志修
徐明训
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan Sanze Information Security Evaluation Co ltd
Original Assignee
Jinan Sanze Information Security Evaluation Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan Sanze Information Security Evaluation Co ltd filed Critical Jinan Sanze Information Security Evaluation Co ltd
Priority to CN202311245211.7A priority Critical patent/CN116992488B/en
Publication of CN116992488A publication Critical patent/CN116992488A/en
Application granted granted Critical
Publication of CN116992488B publication Critical patent/CN116992488B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a differential privacy protection method and a differential privacy protection system, which relate to the technical field of privacy protection, wherein the differential privacy protection method comprises the following steps: establishing a multidimensional feature set according to the initial data set; performing feature assignment on the multidimensional feature set, establishing an initial feature coefficient, and extracting a feature value of data; evaluating the data encryption level of the initial data set to generate a first disturbance association; setting cluster center constraint, randomly grabbing a cluster center, taking a multidimensional feature set as a distance reference to execute data clustering, generating a data clustering result, matching a second disturbance association, executing local differential disturbance of an initial data set, generating an encrypted data set and transmitting the encrypted data set to a server. The method and the device can solve the technical problem of low privacy protection efficiency caused by low privacy protection accuracy and clustering efficiency in the prior art, achieve the aim of improving the privacy protection accuracy and clustering efficiency, and achieve the technical effect of high privacy protection efficiency.

Description

Differential privacy protection method and system
Technical Field
The invention relates to the technical field of privacy protection, in particular to a differential privacy protection method and a differential privacy protection system.
Background
Based on the age of widely applied big data, data collection becomes ubiquitous, and the collected data can provide useful personalized services for users through machine learning, but can cause privacy violation problems. Therefore, anonymization and concealment of collected data are required to be realized based on differential privacy, and further the collected data is used for providing services for users.
In summary, in the prior art, the accuracy and the clustering efficiency of privacy protection are low, which results in the technical problem of low efficiency of privacy protection.
Disclosure of Invention
The invention provides a differential privacy protection method and a differential privacy protection system, which are used for solving the technical problem of lower privacy protection efficiency caused by lower privacy protection accuracy and clustering efficiency in the prior art.
According to a first aspect of the present invention, there is provided a differential privacy preserving method comprising: collecting an initial data set, extracting multidimensional features from the initial data set, and establishing a multidimensional feature set, wherein the initial data set is a real data set; performing feature assignment on the multi-dimensional feature set, establishing an initial feature coefficient, and extracting a feature value of data according to the established multi-dimensional feature set; performing data encryption grade evaluation of the initial data set according to the characteristic value and the initial characteristic coefficient to generate a first disturbance association; setting cluster center constraint, randomly grabbing a cluster center through the cluster center constraint, and executing data clustering by taking the multi-dimensional feature set as a distance reference to generate a data clustering result; matching a second disturbance relation according to a data clustering result, and executing local differential disturbance of the initial data set through the first disturbance relation and the second disturbance relation to generate an encrypted data set; and transmitting the encrypted data set to a server to finish data privacy protection.
According to a second aspect of the present invention, there is provided a differential privacy protection system comprising: the device comprises an initial data set acquisition module, a data acquisition module and a data processing module, wherein the initial data set acquisition module is used for acquiring an initial data set, extracting multidimensional features of the initial data set and establishing a multidimensional feature set, and the initial data set is a real data set; the initial characteristic coefficient obtaining module is used for carrying out characteristic assignment on the multidimensional characteristic set, establishing an initial characteristic coefficient and extracting a characteristic value of data according to the established multidimensional characteristic set; the first disturbance correlation obtaining module is used for evaluating the data encryption level of the initial data set according to the characteristic value and the initial characteristic coefficient to generate a first disturbance correlation; the data clustering result obtaining module is used for setting cluster center constraint, randomly grabbing a cluster center through the cluster center constraint, and executing data clustering by taking the multi-dimensional feature set as a distance reference to generate a data clustering result; the encrypted data set obtaining module is used for matching a second disturbance relation according to a data clustering result, and executing local differential disturbance of the initial data set through the first disturbance relation and the second disturbance relation to generate an encrypted data set; and the data privacy protection module is used for transmitting the encrypted data set to a server to finish data privacy protection.
According to a third aspect of the present invention, a computer device comprises a memory storing a computer program and a processor implementing a method capable of executing any one of the first aspects.
According to a fourth aspect of the present invention, a computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements a method capable of performing any of the first aspects.
One or more technical schemes provided by the invention have at least the following technical effects or advantages: according to the method, the initial data set is acquired, multidimensional feature extraction is carried out on the initial data set, and a multidimensional feature set is established, wherein the initial data set is a real data set; performing feature assignment on the multi-dimensional feature set, establishing an initial feature coefficient, and extracting a feature value of data according to the established multi-dimensional feature set; performing data encryption grade evaluation of the initial data set according to the characteristic value and the initial characteristic coefficient to generate a first disturbance association; setting cluster center constraint, randomly grabbing a cluster center through the cluster center constraint, and executing data clustering by taking the multi-dimensional feature set as a distance reference to generate a data clustering result; matching a second disturbance relation according to a data clustering result, and executing local differential disturbance of the initial data set through the first disturbance relation and the second disturbance relation to generate an encrypted data set; the encrypted data set is transmitted to the server, so that data privacy protection is completed, the technical problem that the privacy protection efficiency is low due to low precision and clustering efficiency of the privacy protection in the prior art is solved, the purpose of improving the precision and clustering efficiency of the privacy protection is achieved, and the technical effect of low privacy protection efficiency is achieved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following brief description will be given of the drawings used in the description of the embodiments or the prior art, it being obvious that the drawings in the description below are only exemplary and that other drawings can be obtained from the drawings provided without the inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a differential privacy protection method according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a differential privacy protection system according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Reference numerals illustrate: the system comprises an initial data set obtaining module 11, an initial characteristic coefficient obtaining module 12, a first disturbance association obtaining module 13, a data clustering result obtaining module 14, an encrypted data set obtaining module 15, a data privacy protecting module 16, a computer device 100, a processor 101, a memory 102 and a bus 103.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Example 1
The differential privacy protection method provided by the embodiment of the invention is described with reference to fig. 1, and the method comprises the following steps:
the method provided by the embodiment of the invention comprises the following steps:
and acquiring an initial data set, extracting multidimensional features from the initial data set, and establishing a multidimensional feature set, wherein the initial data set is a real data set.
Specifically, the initial data set is a real data set. For example, zhang Sanbecause smoking suffers from cancer is true data. Further, an initial data set is acquired by a statistical data method. Wherein the initial data set is a set of real data of a plurality of individual users or a set of real data of a group of users composed of a plurality of individual users. Further, multi-dimensional feature extraction is performed on the initial dataset, and a multi-dimensional feature set is established. For example, zhang San suffers from cancer, but there may be various causes of the cancer, and specific analysis of the cause of the cancer in combination with Zhang San diet, congenital diseases, genetics, working environment, lifestyle and the like is required, and the possible cause of the cancer is a multidimensional feature.
And carrying out feature assignment on the multi-dimensional feature set, establishing an initial feature coefficient, and extracting a feature value of data according to the established multi-dimensional feature set.
Specifically, feature assignment is performed on the multi-dimensional feature set, and feature values of data are extracted according to the multi-dimensional feature set. For example, the causes of cancer of Zhang-three are analyzed through multidimensional feature sets such as diet, congenital diseases, genetics, working environment, life habits and the like, diet features of the multidimensional feature sets are assigned through a method of acquiring diet data of Zhang-three, daily diet data of Zhang-three are acquired, and then the daily diet data are taken as feature values. Further, initial characteristic coefficients are initially established according to the acquisition influence degree of the initial data set of the multi-dimensional characteristic set. For example, in the multi-dimensional feature set of Zhang three cancer, if the influence degree of diet and life habit is large, a higher coefficient is assigned to the feature of diet and life habit.
And evaluating the data encryption level of the initial data set according to the characteristic value and the initial characteristic coefficient, and generating a first disturbance association.
Specifically, self-adaptive coefficient distribution is carried out according to the characteristic value and the initial characteristic coefficient, a distribution result is obtained, data encryption grade evaluation of the initial data set is carried out according to the distribution result, and a first disturbance association is generated according to the data encryption grade evaluation. The disturbance is to consider the influence of some physical factors on the system operation data in the operation or design process, so that the disturbance parameters are substituted into the simulation.
Setting cluster center constraint, randomly grabbing a cluster center through the cluster center constraint, and executing data clustering by taking the multi-dimensional feature set as a distance reference to generate a data clustering result.
Specifically, cluster center constraints are set. For example, the cluster center is constrained to be within the data range of the cluster. Further, one data is randomly grabbed in each cluster as a cluster center through cluster center constraint. And taking the feature similarity degree of the multidimensional feature set as a distance reference, wherein the larger the feature similarity degree is, the closer the distance reference is. Further, data clustering is carried out according to the distance reference, and data closer to the distance reference are classified into the same cluster, so that a data clustering result is generated.
And matching a second disturbance relation according to the data clustering result, and executing local differential disturbance of the initial data set through the first disturbance relation and the second disturbance relation to generate an encrypted data set.
Specifically, a second disturbance relation is matched according to a data clustering result, local differential disturbance of an initial data set is executed through the first disturbance relation and the second disturbance relation, disturbance difference of differential privacy is calculated, data encryption is conducted on the disturbance difference, and an encrypted data set is generated.
And transmitting the encrypted data set to a server to finish data privacy protection.
Specifically, data in clusters are randomly ordered to obtain random ordering results, unique identification is carried out on the random ordering results, head-to-tail data connection is carried out according to the random ordering of the unique identification, and a closed chain structure of each cluster is established. And carrying out data exchange between adjacent data for each closed chain structure to obtain a data exchange result, and transmitting an encrypted data set corresponding to the data exchange result to a server to finish data privacy protection.
The technical problems that in the prior art, due to low precision and clustering efficiency of privacy protection, the efficiency of the privacy protection is low can be solved, the aims of improving the precision and the clustering efficiency of the privacy protection are fulfilled, and the technical effect of low efficiency of the privacy protection is achieved.
The method provided by the embodiment of the invention further comprises the following steps:
and determining the number N of clusters, wherein the number N of clusters is determined according to the data volume of the initial data set.
Constructing a data space, wherein the data space characterizes the data distribution space of the initial data set, and the data space is uniformly divided by the number N of clusters to generate N uniform areas.
And randomly distributing cluster centers in the N uniform areas, judging the cluster centers through cluster center constraint, and completing data clustering according to the judging result.
Specifically, a cluster is a collection of a set of samples generated by a cluster. Samples within the same cluster are similar to each other and are different from samples in other clusters. Further, the initial data set is segmented and clustered according to the data quantity of the initial data set, and the number N of clusters is determined. Wherein, the more the data amount of the initial data set is, the more the cluster number N is. For example, an initial data set of cancer patients in a certain hospital case is obtained, the number of clusters N is determined according to the data amount of the cancer types in the initial data set, and the cancer types can include lung cancer, head and neck cancer, esophageal cancer, liver cancer and other types, so as to obtain the number of clusters N.
Further, the data space is a bottom framework of distributed multi-element tag data storage, which is a technical system for making data safe and efficient connection. Further, a data space is constructed, the data space characterizes a data distribution space of the initial data set, the data space is uniformly divided by the number N of clusters, and N uniform areas are generated. Wherein each uniform region corresponds to one cluster. For example, if the initial data set is cancer data, the clusters may include liver cancer clusters, and the uniform region includes liver cancer clusters.
Further, the cluster center is a special sample in the cluster analysis, and is used for representing a certain class, and other samples determine whether the sample belongs to the class by calculating the distance from the sample. Further, randomly selecting a distributed cluster center from the N uniform areas, judging the cluster centers through cluster center constraint, namely calculating the centers of the N clusters, updating the N cluster centers, and completing data clustering according to the judging result.
Wherein, the efficiency of differential privacy protection can be improved by carrying out data clustering.
The method provided by the embodiment of the invention further comprises the following steps:
and taking the clustering center as a clustering concentration point, executing data clustering grabbing within a preset distance, and generating a data grabbing result.
And carrying out data centralized evaluation on the data grabbing results, and updating the central position of the clustering center according to the centralized evaluation results.
And (3) iterating the process of cluster grabbing and cluster centers, stopping iterating when the iterating result meets the preset requirement, completing data clustering, and taking the final cluster center as a cluster label.
Specifically, a cluster center is used as a cluster concentration point, wherein the cluster concentration point represents that the cluster data are more similar. Further, a predetermined distance is set, which is custom set by those skilled in the art. And executing data clustering grabbing within a preset distance to generate a data grabbing result. For example, the clustering center is data of liver cancer, and a predetermined distance is set by taking the data as a clustering concentration point, so as to perform data capture on clusters of lung cancer, head and neck cancer, esophageal cancer and the like within the predetermined distance. Or for example, if the cluster center is (0, 0), setting a circle with a preset distance as a radius, wherein if the circle with the radius of 5 is the preset distance, capturing coordinate points of the data within the preset distance.
Further, data set evaluation is carried out on the data grabbing results, euclidean distance between each grabbing data in the data grabbing results is calculated, a center point is calculated according to the Euclidean distance between each grabbing data, and the center position of the clustering center is updated. For example, if the grasping result includes (1, 4), (2, 2), the Euclidean distance is calculated for (1, 4), (2, 2), according to the distance formula between two points:
respectively calculating the distance between each point, and calculating the center positions of multiple points with the closer distance, for example, the clustering centers of (1, 4), (2, 2) are
=(1.5,3),
Further, the center position of the cluster center is updated (1.5,3).
Further, the process of cluster grabbing and cluster center is iterated, and the center position in the iteration result is changed continuously. And stopping iteration when the iteration result meets the preset requirement. For example, the predetermined requirement is that the center position of the cluster center is no longer moving or that no data objects are reassigned to different clusters. Further, data clustering is further completed, and a final clustering center is used as a clustering label, so that a plurality of clustering labels are obtained.
The data is classified through clustering, so that the efficiency of differential privacy protection can be improved.
The method provided by the embodiment of the invention further comprises the following steps:
and configuring a coefficient self-adaptive allocation function.
And inputting the multi-dimensional feature set into the self-adaptive distribution function, and performing self-adaptive coefficient distribution based on the feature value and the initial feature coefficient to generate a distribution result.
And carrying out weighted calculation on the characteristic value and the initial characteristic coefficient based on the distribution result, and generating data encryption grade evaluation according to the weighted calculation result.
Specifically, a coefficient adaptive allocation function is configured. And configuring a coefficient self-adaptive distribution function according to the strength of the influence degree of the characteristic value on the acquisition of the initial data set, namely the actual degree of the initial data set, in the process of acquiring the initial data set by the multi-dimensional characteristic set corresponding to the initial characteristic coefficient. And the initial characteristic coefficients are adjusted through a coefficient self-adaptive distribution function. Wherein, the initial data set obtaining formula is:
wherein F is an initial data set obtained according to the initial characteristic coefficient, k is the initial characteristic coefficient, m is a characteristic value, i is used for representing the ith characteristic of the multi-dimensional characteristic set, and n is the characteristic quantity of the multi-dimensional characteristic set. And obtaining an initial data set according to multiplication of the initial characteristic coefficient and the characteristic value. The coefficient adaptive allocation function is:
wherein, the distribution result after the initial characteristic coefficient distribution is used as the characteristic coefficient, K is the characteristic coefficient, and v is the distribution coefficient. And carrying out coefficient self-adaptive distribution according to the coefficient self-adaptive distribution function to obtain the characteristic coefficient.
According to the characteristic coefficient after the initial characteristic coefficient distribution, an initial data set obtaining formula is obtained as follows:
where G is the initial dataset obtained from the feature coefficients.
Further, the coefficient self-adaptive distribution function is configured in a self-defined mode according to the influence degree of the two initial data sets, namely the actual degree.
Further, the multidimensional feature set is input into an adaptive allocation function, adaptive coefficient allocation is carried out based on the feature value and the initial feature coefficient, and an allocation result is generated. For example, if the characteristics such as diet and life habit of the multi-dimensional feature set of Zhang three are obtained according to the initial feature coefficients, but the influence degree of the life habit characteristics in the cancer suffering factors corresponding to the obtained initial data set on the cancer suffering is higher through investigation, the coefficient self-adaptive distribution function is configured to obtain the feature coefficients, and then the initial data set is obtained again according to the feature values, so that the accuracy of predicting the cancer suffering is higher.
Further, the feature value and the initial feature coefficient are weighted based on the distribution result, and the data encryption level evaluation is generated according to the weighted calculation result. For example, the data encryption level evaluation is obtained based on the degree of realism of the weighted calculation result. The closer the weighted calculation result is to the initial data set, the higher the data encryption level evaluation.
The differential privacy protection efficiency can be improved by acquiring the data encryption grade evaluation.
The method provided by the embodiment of the invention further comprises the following steps:
and respectively carrying out random ordering on the data in the same cluster to the data clustering result, and recording the unique identification of the random ordering.
And establishing a closed chain structure of each cluster according to the unique identifier.
Data exchange between adjacent data is performed for each closed chain structure.
And processing the encrypted data set according to the data exchange result and transmitting the processed encrypted data set to a server.
Specifically, the random ordering of the data in the same cluster is carried out on the data in the data clustering result, the random ordering is obtained, and the unique identification of the random ordering is recorded. For example, (1, 4), (2, 2), (3, 2) within the same cluster are randomly ordered to obtain random ordering (2, 2), (3, 2), (1, 4).
Further, head-to-tail data connection is carried out according to the random ordering of the unique identification, and a closed chain structure of each cluster is established. Further, data exchange between adjacent data is performed for each closed chain structure. For example, the data exchange may be performed by a cluster merging method. Further, the encrypted data set is processed according to the data exchange result and then transmitted to the server. Wherein processing the encrypted data set is preprocessing the data exchange result, for example preprocessing into unified data units.
Wherein, obtaining the data exchange result and carrying out the encryption data set processing can improve the differential privacy protection efficiency.
The method provided by the embodiment of the invention further comprises the following steps:
and taking the data with the unique identifier of 1 in the closed chain structure as initial data, and randomly selecting the data in the initial data as initial exchange data.
And carrying out hash calculation on the initial exchange data to generate a calculation result, taking the calculation result as an exchange matching value, and matching the exchange data with the unique identifier of 2.
And adding the initial exchange data to second data, removing the exchange data from the second data, and taking the exchange data subjected to hash calculation as an exchange matching value of third exchange data.
And performing exchange iteration until all data of the closed chain structure are subjected to data exchange termination.
Specifically, the hash algorithm is to process any input data through a hash function, and then generate a hash function output result, where the hash function output result can store the input data. For example, a hash algorithm can map binary plaintext data of arbitrary length to shorter binary strings, and different plaintext data are difficult to map to the same hash function value. Further, hash algorithm calculation is performed on the data of the closed chain structure, and the data uniquely identified as 1 in the closed chain structure is used as initial data. Wherein the data uniquely identified as 1 is an exemplary calculation result. Further, the data in the initial data is randomly selected as initial exchange data.
Further, hash calculation is carried out on the initial exchange data, a calculation result is generated, the calculation result is used as an exchange matching value, and the exchange data with the unique identifier of 2 is matched. Namely, the data of the closed chain structure are sequentially subjected to exchange matching from the initial exchange data. Wherein the exchange data uniquely identified as 2 is an exemplary calculation result.
Further, the initial exchange data is added to the second data, the exchange data with the unique identification of 2 is removed from the second data, and the exchange data with hash calculation is used as an exchange matching value of the third exchange data. The hash calculation is sequentially carried out on the data with the closed chain structure, and the data are sequentially used as exchange matching values of third exchange data.
Further, the exchange iteration is sequentially executed according to the exchange matching value of the third exchange data until all data of the closed chain structure are exchanged, and exchange termination is completed.
Wherein, the differential privacy protection efficiency can be improved by carrying out data exchange.
The method provided by the embodiment of the invention further comprises the following steps:
when the iteration is stopped, it is determined whether or not isolated data exists.
And if the isolated data exist, calculating the data distance between the isolated data and each final clustering center.
And carrying out data classification of the isolated data according to the comparison result of the data distance so as to finish data clustering.
Specifically, when the iteration result meets the preset requirement, stopping iteration, and judging whether isolated data exist or not. The isolated data is the data which is not clustered, namely the data which has lower similarity with the clustered data and has a longer Euclidean distance.
Further, if the isolated data exist, calculating the data distance between the isolated data and each final clustering center through a Euclidean distance calculation method, and obtaining a distance calculation result.
Further, according to the data distance between the isolated data and each final clustering center in the distance calculation result, carrying out serialization processing according to the distance from small to large, wherein the first distance is the minimum, extracting the isolated data corresponding to the first distance and the final clustering center, and classifying the isolated data into the clustering cluster of the final clustering center corresponding to the first data distance to finish data clustering.
Wherein, clustering the isolated data can improve the reliability and the data integrity of the clustering.
Example two
Based on the same inventive concept as the differential privacy protection method in the foregoing embodiment, the present invention is further described with reference to fig. 2, and the differential privacy protection system further includes:
the initial data set obtaining module 11 is configured to collect an initial data set, and perform multidimensional feature extraction on the initial data set to establish a multidimensional feature set, where the initial data set is a real data set.
The initial feature coefficient obtaining module 12 is configured to perform feature assignment on the multi-dimensional feature set, establish an initial feature coefficient, and extract a feature value of data according to the established multi-dimensional feature set.
The first disturbance association obtaining module 13 is configured to perform data encryption level evaluation on the initial data set according to the feature value and the initial feature coefficient, so as to generate a first disturbance association.
The data clustering result obtaining module 14 is configured to set a cluster center constraint, randomly grab a cluster center through the cluster center constraint, perform data clustering by using the multi-dimensional feature set as a distance reference, and generate a data clustering result.
The encrypted data set obtaining module 15 is configured to match a second perturbation association according to a data clustering result, and execute local differential perturbation of the initial data set through the first perturbation association and the second perturbation association, so as to generate an encrypted data set.
The data privacy protection module 16, the data privacy protection module 16 is configured to transmit the encrypted data set to a server, so as to complete data privacy protection.
Further, the system further comprises:
the cluster number obtaining module is used for determining the cluster number N, and the cluster number N is determined according to the data volume of the initial data set.
The data space obtaining module is used for constructing a data space, the data space represents the data distribution space of the initial data set, the data space is uniformly divided by the number N of the clusters, and N uniform areas are generated.
The cluster center obtaining module is used for randomly distributing the cluster centers in the N uniform areas, judging the cluster centers through cluster center constraint and completing data clustering according to judging results.
Further, the system further comprises:
the data capture result obtaining module is used for performing data clustering capture within a preset distance by taking a clustering center as a clustering concentration point to generate a data capture result.
And the central position obtaining module is used for carrying out data centralized evaluation on the data grabbing results and updating the central position of the clustering center according to the centralized evaluation results.
And the data clustering module is used for iterating the process of clustering grabbing and clustering centers, stopping iteration when the iteration result meets the preset requirement, completing data clustering, and taking the final clustering center as a clustering label.
Further, the system further comprises:
and the coefficient adaptive distribution function obtaining module is used for configuring the coefficient adaptive distribution function.
The distribution result obtaining module is used for inputting the multi-dimensional feature set into the self-adaptive distribution function, and carrying out self-adaptive coefficient distribution based on the feature value and the initial feature coefficient to generate a distribution result.
And the data encryption grade evaluation obtaining module is used for carrying out weighted calculation on the characteristic value and the initial characteristic coefficient based on the distribution result and generating data encryption grade evaluation according to the weighted calculation result.
Further, the system further comprises:
and the data random ordering module is used for respectively carrying out random ordering on the data in the same cluster to the data clustering result and recording the unique identification of the random ordering.
The closed chain structure obtaining module is used for establishing a closed chain structure of each cluster according to the unique identification.
And the data exchange module is used for carrying out data exchange between adjacent data for each closed chain structure.
And the encrypted data set processing module is used for processing the encrypted data set according to the data exchange result and transmitting the processed encrypted data set to the server.
Further, the system further comprises:
and the initial exchange data acquisition module is used for taking the data with the unique identifier of 1 in the closed chain structure as initial data and randomly selecting the data in the initial data as initial exchange data.
And the exchange matching value obtaining module is used for carrying out hash calculation on the initial exchange data to generate a calculation result, taking the calculation result as an exchange matching value and matching the exchange data with the unique identifier of 2.
And the second data acquisition module is used for adding the initial exchange data to second data, removing the exchange data from the second data and taking the exchange data subjected to hash calculation as an exchange matching value of third exchange data.
And the exchange iteration module is used for executing exchange iteration until all data of the closed chain structure are subjected to data exchange termination.
Further, the system further comprises:
and the isolated data obtaining module is used for judging whether isolated data exist or not when iteration is stopped.
The data distance obtaining module is used for calculating the data distance between the isolated data and each final clustering center if the isolated data exists.
And the data classifying module is used for classifying the data of the isolated data according to the comparison result of the data distance so as to finish data clustering.
A specific example of a differential privacy protection method in the first embodiment is also applicable to a differential privacy protection system in the present embodiment, and a person skilled in the art can clearly know the differential privacy protection system in the present embodiment through the foregoing detailed description of the differential privacy protection method, so that the description is omitted herein for brevity. The device of the embodiment invention is relatively simple to describe as it corresponds to the method of the embodiment invention, and the relevant points are described in the method section.
Example III
Fig. 3 is a schematic diagram of a third embodiment of the present invention, and as shown in fig. 3, a computer device 100 of the present invention may include: a processor 101 and a memory 102.
A memory 102 for storing a program; the memory 102 may include a volatile memory (english: volatile memory), such as a random-access memory (RAM), such as a static random-access memory (SRAM), a double data rate synchronous dynamic random-access memory (DDR SDRAM), etc.; the memory may also include a non-volatile memory (English) such as a flash memory (English). The memory 102 is used to store computer programs (e.g., application programs, functional modules, etc. that implement the methods described above), computer instructions, etc., which may be stored in one or more of the memories 102 in partitions. And the above-described computer programs, computer instructions, data, etc. may be invoked by the processor 101.
The computer programs, computer instructions, etc. described above may be stored in one or more of the memories 102 in partitions. And the above-described computer programs, computer instructions, etc. may be invoked by the processor 101.
A processor 101 for executing a computer program stored in a memory 102 to implement the steps of the method according to the above-mentioned embodiment.
Reference may be made in particular to the description of the embodiments of the method described above.
The processor 101 and the memory 102 may be separate structures or may be integrated structures integrated together. When the processor 101 and the memory 102 are separate structures, the memory 102 and the processor 101 may be coupled by a bus 103.
The computer device in this embodiment may execute the technical solution in the above method, and the specific implementation process and the technical principle are the same, which are not described herein again.
According to an embodiment of the present invention, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed, implements the steps provided by any of the embodiments described above.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method of differential privacy protection, the method comprising:
collecting an initial data set, extracting multidimensional features from the initial data set, and establishing a multidimensional feature set, wherein the initial data set is a real data set;
performing feature assignment on the multi-dimensional feature set, establishing an initial feature coefficient, and extracting a feature value of data according to the established multi-dimensional feature set;
performing data encryption grade evaluation of the initial data set according to the characteristic value and the initial characteristic coefficient to generate a first disturbance association;
setting cluster center constraint, randomly grabbing a cluster center through the cluster center constraint, and executing data clustering by taking the multi-dimensional feature set as a distance reference to generate a data clustering result;
matching a second disturbance relation according to a data clustering result, and executing local differential disturbance of the initial data set through the first disturbance relation and the second disturbance relation to generate an encrypted data set;
and transmitting the encrypted data set to a server to finish data privacy protection.
2. The method of claim 1, wherein the method further comprises:
determining the number N of clusters, wherein the number N of clusters is determined according to the data volume of the initial data set;
constructing a data space, wherein the data space represents a data distribution space of the initial data set, and the data space is uniformly divided by the number N of clusters to generate N uniform areas;
and randomly distributing cluster centers in the N uniform areas, judging the cluster centers through cluster center constraint, and completing data clustering according to the judging result.
3. The method of claim 2, wherein the method further comprises:
taking the clustering center as a clustering concentration point, executing data clustering grabbing within a preset distance, and generating a data grabbing result;
carrying out data centralized evaluation on the data grabbing results, and updating the central position of the clustering center according to the centralized evaluation results;
and (3) iterating the process of cluster grabbing and cluster centers, stopping iterating when the iterating result meets the preset requirement, completing data clustering, and taking the final cluster center as a cluster label.
4. The method of claim 1, wherein the method further comprises:
configuring a coefficient self-adaptive allocation function;
inputting the multi-dimensional feature set into the self-adaptive distribution function, and performing self-adaptive coefficient distribution based on the feature value and the initial feature coefficient to generate a distribution result;
and carrying out weighted calculation on the characteristic value and the initial characteristic coefficient based on the distribution result, and generating data encryption grade evaluation according to the weighted calculation result.
5. The method of claim 1, wherein the method further comprises:
respectively carrying out random ordering on the data in the same cluster to the data clustering result, and recording a unique identifier of the random ordering;
establishing a closed chain structure of each cluster according to the unique identifier;
carrying out data exchange between adjacent data for each closed chain structure;
and processing the encrypted data set according to the data exchange result and transmitting the processed encrypted data set to a server.
6. The method of claim 5, wherein the method further comprises:
taking the data with the unique identifier 1 in the closed chain structure as initial data, and randomly selecting the data in the initial data as initial exchange data;
carrying out hash calculation on the initial exchange data to generate a calculation result, taking the calculation result as an exchange matching value, and matching the exchange data with the unique identifier of 2;
adding the initial exchange data to second data, removing the exchange data from the second data, and taking the exchange data subjected to hash calculation as an exchange matching value of third exchange data;
and performing exchange iteration until all data of the closed chain structure are subjected to data exchange termination.
7. A method as claimed in claim 3, wherein the method further comprises:
when the iteration is stopped, judging whether isolated data exist or not;
if the isolated data exist, calculating the data distance between the isolated data and each final clustering center;
and carrying out data classification of the isolated data according to the comparison result of the data distance so as to finish data clustering.
8. A differential privacy preserving system for implementing a differential privacy preserving method as claimed in any one of claims 1 to 7, the system comprising:
the device comprises an initial data set acquisition module, a data acquisition module and a data processing module, wherein the initial data set acquisition module is used for acquiring an initial data set, extracting multidimensional features of the initial data set and establishing a multidimensional feature set, and the initial data set is a real data set;
the initial characteristic coefficient obtaining module is used for carrying out characteristic assignment on the multidimensional characteristic set, establishing an initial characteristic coefficient and extracting a characteristic value of data according to the established multidimensional characteristic set;
the first disturbance correlation obtaining module is used for evaluating the data encryption level of the initial data set according to the characteristic value and the initial characteristic coefficient to generate a first disturbance correlation;
the data clustering result obtaining module is used for setting cluster center constraint, randomly grabbing a cluster center through the cluster center constraint, and executing data clustering by taking the multi-dimensional feature set as a distance reference to generate a data clustering result;
the encrypted data set obtaining module is used for matching a second disturbance relation according to a data clustering result, and executing local differential disturbance of the initial data set through the first disturbance relation and the second disturbance relation to generate an encrypted data set;
and the data privacy protection module is used for transmitting the encrypted data set to a server to finish data privacy protection.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1-7 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1-7.
CN202311245211.7A 2023-09-26 2023-09-26 Differential privacy protection method and system Active CN116992488B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311245211.7A CN116992488B (en) 2023-09-26 2023-09-26 Differential privacy protection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311245211.7A CN116992488B (en) 2023-09-26 2023-09-26 Differential privacy protection method and system

Publications (2)

Publication Number Publication Date
CN116992488A true CN116992488A (en) 2023-11-03
CN116992488B CN116992488B (en) 2024-01-05

Family

ID=88521644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311245211.7A Active CN116992488B (en) 2023-09-26 2023-09-26 Differential privacy protection method and system

Country Status (1)

Country Link
CN (1) CN116992488B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104394509A (en) * 2014-11-21 2015-03-04 西安交通大学 High-efficiency difference disturbance location privacy protection system and method
CN109977324A (en) * 2019-03-28 2019-07-05 南京邮电大学 A kind of point of interest method for digging and system
CN111628974A (en) * 2020-05-12 2020-09-04 Oppo广东移动通信有限公司 Differential privacy protection method and device, electronic equipment and storage medium
US20200356858A1 (en) * 2019-05-10 2020-11-12 Royal Bank Of Canada System and method for machine learning architecture with privacy-preserving node embeddings

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104394509A (en) * 2014-11-21 2015-03-04 西安交通大学 High-efficiency difference disturbance location privacy protection system and method
CN109977324A (en) * 2019-03-28 2019-07-05 南京邮电大学 A kind of point of interest method for digging and system
US20200356858A1 (en) * 2019-05-10 2020-11-12 Royal Bank Of Canada System and method for machine learning architecture with privacy-preserving node embeddings
CN111628974A (en) * 2020-05-12 2020-09-04 Oppo广东移动通信有限公司 Differential privacy protection method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周?;: "加权社交网络深度差分隐私数据保护算法研究", 计算机仿真, no. 10 *
田崇瑞;李兆祥;罗宇新;: "基于动态数据挖掘的匿名化隐私保护方法仿真", 计算机仿真, no. 11 *

Also Published As

Publication number Publication date
CN116992488B (en) 2024-01-05

Similar Documents

Publication Publication Date Title
Luo et al. An inherently nonnegative latent factor model for high-dimensional and sparse matrices from industrial applications
CN104601565B (en) A kind of network invasion monitoring sorting technique of intelligent optimization rule
Minaei-Bidgoli et al. Ensembles of partitions via data resampling
Mandal et al. An improved minimum redundancy maximum relevance approach for feature selection in gene expression data
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
CN109886284B (en) Fraud detection method and system based on hierarchical clustering
Guo et al. Machine learning based feature selection and knowledge reasoning for CBR system under big data
Dubey et al. A systematic review on k-means clustering techniques
Karrar The effect of using data pre-processing by imputations in handling missing values
Ali et al. A novel approach for big data classification based on hybrid parallel dimensionality reduction using spark cluster
Jha et al. Criminal behaviour analysis and segmentation using k-means clustering
CN115168326A (en) Hadoop big data platform distributed energy data cleaning method and system
Orzechowski et al. Benchmarking manifold learning methods on a large collection of datasets
Rahman et al. An efficient approach for selecting initial centroid and outlier detection of data clustering
CN116992488B (en) Differential privacy protection method and system
CN116821820A (en) False transaction identification method and device, electronic equipment and storage medium
Dhoot et al. Efficient Dimensionality Reduction for Big Data Using Clustering Technique
CN109739840A (en) Data processing empty value method, apparatus and terminal device
Irigoien et al. The depth problem: identifying the most representative units in a data group
CN106933805B (en) Method for identifying biological event trigger words in big data set
Kancharla Feature selection in big data using filter based techniques
CN107180391B (en) Wind power data span selection method and device
CN111309782A (en) Subspace-based outlier detection algorithm
Liu et al. [Retracted] An Accurate Method of Determining Attribute Weights in Distance‐Based Classification Algorithms
CN118335200B (en) Lung adenocarcinoma subtype classification system, medium and equipment based on causal feature selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant