CN109495476B - Data stream differential privacy protection method and system based on edge calculation - Google Patents

Data stream differential privacy protection method and system based on edge calculation Download PDF

Info

Publication number
CN109495476B
CN109495476B CN201811379012.4A CN201811379012A CN109495476B CN 109495476 B CN109495476 B CN 109495476B CN 201811379012 A CN201811379012 A CN 201811379012A CN 109495476 B CN109495476 B CN 109495476B
Authority
CN
China
Prior art keywords
data
time window
feature
disturbance noise
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811379012.4A
Other languages
Chinese (zh)
Other versions
CN109495476A (en
Inventor
张尧学
刘峻丞
任炬
胥楚贵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201811379012.4A priority Critical patent/CN109495476B/en
Publication of CN109495476A publication Critical patent/CN109495476A/en
Application granted granted Critical
Publication of CN109495476B publication Critical patent/CN109495476B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/06Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
    • H04L9/0643Hash functions, e.g. MD5, SHA, HMAC or f9 MAC

Abstract

The invention discloses a data stream differential privacy protection method and a data stream differential privacy protection system based on edge calculation, wherein the method comprises the following steps: s1, edge equipment receives characteristic data which is acquired by terminal equipment and subjected to characteristic extraction through a preset encoder; s2, aggregating the characteristic data and adding disturbance noise; s3, performing characteristic reconstruction on the characteristic data added with the disturbance noise through a preset decoder to obtain reconstructed data; the encoder and the decoder are obtained by training the same self-encoder. The method has the advantages of small service response delay, high service quality, high system throughput, small calculation load of each edge device, small data transmission quantity between the user and the edge device, high privacy protection degree and the like.

Description

Data stream differential privacy protection method and system based on edge calculation
Technical Field
The invention relates to the field of edge computing, in particular to a data stream differential privacy protection method and system based on edge computing.
Background
With the advent of the information age, the information technology industry has rapidly developed. The Internet is one of the fastest-growing information technology industries, and is an indispensable part of various fields, because it can provide diversified services to users. With the increasing variety and explosive increase of the number of Internet terminal devices and the remarkable increase of the Quality of Service (QoS) and diversification demand of users, the Internet also faces many challenges today. Among them, how to process a large amount of data in the Internet, how to guarantee real-time performance of services, and how to ensure security of users are three main challenges.
Cloud computing, as a computing model based on the Internet, provides on-demand scalable services through centralized computing and storage. However, with the growth of terminal devices and data volume and the rapid development of ubiquitous network technology, not only a large amount of network bandwidth is occupied by transferring computation to the cloud, but also the delay of service requests and responses is increased, and particularly in terms of application support sensitive to delay, cloud computing featuring centralized computation has been difficult to meet the development requirements of these technologies and applications. Thereby promoting the rising of calculation modes such as edge calculation, fog calculation and the like. In principle, edge computing and fog computing have similar ideas, and the purpose is to enable computing to be closer to users, namely cloud computing is expanded from a centralized large data center to a network edge which is close to the users, so that the defects of network bottleneck, high delay and the like of the traditional cloud computing are overcome, and the service request response speed and the user experience of end users are improved. Technically, the cloud computing and edge computing relieve the computing pressure of the cloud by deploying a special server or a small and medium-sized computing center at the edge of a network close to a user, and improve the QoS of the user service. By utilizing edge calculation, the requirements of users can be better met under the scenes of large data volume and high real-time requirement.
In the past cloud computing, data needs to be stored in the cloud end and then processed, and therefore the time for responding to a service request is increased. If the mode of storing before processing is also adopted in the edge calculation, although the response time can be shortened by using the calculation closer to the user, the method is still not a good solution in the scene with higher real-time requirement. Therefore, if data can be processed during data transmission, the service request response time is greatly shortened, which is a real-time data flow solution based on edge calculation. Kafka, as a distributed streaming platform (a distributed streaming platform), can provide real-time data stream processing capability for edge devices. It possesses three key characteristics: (1) being able to publish and subscribe to streaming data; (2) the ability to securely store streaming data in a cluster with distributed, reproducible, and fault tolerant mechanisms; (3) the arriving stream data can be processed in time. These three characteristics are required for a streaming platform. In Kafka, topic is an abstraction of a group of messages, or a classification of messages. In a typical Producer Consumer model, the Producer may send messages to a topic, which are stored in a Kafka server called brokers, and the Consumer may then subscribe to the topic and consume them from the brokers.
Although the use of edge computing and real-time processing of data streams can provide benefits for the analysis of data, edge computing, like other traditional computing models, also faces serious safety issues. As in mobile applications, many online services rely on personal data collected from users, which can enhance the utility of mobile applications, providing personalized services to users, such as ad-pushing, purchasing preferences, etc., but which can also be used by malicious attackers to infer sensitive information about users, such as gender inference, location tracking, speaker identification, etc. From the user's perspective, the user desires to expose as little private information as possible, i.e., as little user personal data as possible is collected. From the service provider's perspective, it is desirable to collect more user personal data to provide better service. Obviously, there is an essential contradiction between the two. Therefore, how to balance the usability of the collected information and the security of the user privacy is a matter that needs to be carefully considered.
The techniques adopted in the existing schemes for protecting user privacy mainly include: anonymization processing, data conversion, data encryption, differential privacy and the like, but even if the technologies are adopted, the current scheme still has the following defects:
1. currently, even if edge computing has a portion of the computing dropped from the cloud to an edge device near the user, the service with high real-time requirements cannot be met.
2. Edge computing faces the problem of security, i.e., the data handled by the edge device involves a contradiction between data availability and privacy security.
3. At present, most of privacy protection modes adopt centralized data cleaning (privacy removal) to limit the throughput of a system, and the requirement of low-delay service cannot be met.
4. There is a conflict between edge device computing power and security policies.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides the data flow difference privacy protection method and the data flow difference privacy protection system based on the edge computing, which have the advantages of small service response delay, high service quality, high system throughput, small computing load of each edge device, small data transmission quantity between a user and the edge device and high privacy protection degree.
In order to solve the technical problems, the technical scheme provided by the invention is as follows: a data flow differential privacy protection method based on edge calculation comprises the following steps:
s1, edge equipment receives characteristic data which is acquired by terminal equipment and subjected to characteristic extraction through a preset encoder;
s2, aggregating the characteristic data and adding disturbance noise;
s3, performing characteristic reconstruction on the characteristic data added with the disturbance noise through a preset decoder to obtain reconstructed data;
the encoder and the decoder are obtained by training the same self-encoder.
Further, in step S1, the feature data is acquired by the terminal device in one acquisition time window by using a preset acquisition time window as a unit, and is obtained after feature extraction is performed by a preset encoder.
Further, step S2 specifically includes: and the input layer node in the edge device aggregates the received characteristic data acquired by the terminal devices in the first time window according to a preset first time window, calculates the disturbance noise budget of each characteristic data, and adds disturbance noise to the characteristic data according to the disturbance noise budget.
Further, the disturbance noise budget is calculated and determined according to equation (1):
Figure GDA0002621498980000031
in the formula (1), n is the total number of terminal devices and n is the preset total privacy budgetkIs at presentInputting the number of terminal devices connected to the nodes of the layer,kfor the privacy budget of the current input layer node, βiRepresenting the proportion of each feature in the privacy budget of the current input level node within the current first time window, d representing the dimension of the feature,
Figure GDA0002621498980000032
representing the average correlation degree of the ith input feature in the current first time window of the current input layer node, namely taking the current feature as a central point, calculating the average Euclidean distance between adjacent features, fjRepresenting the jth characteristic value within the current first time window of the current input level node,ia privacy budget for an ith input feature within a current first time window of a current input layer node.
Further, adding disturbance noise to the feature data according to equation (2):
fi'=fi+Lap(Δh0/i) (2)
in the formula (2), fi' feature value after adding disturbance noise, fiTo add the eigenvalues before the disturbance noise, Δ h0Identifying global sensitivity, Lap (-) is a Laplace distribution,ia privacy budget for an ith input feature within a current first time window of a current input layer node.
Further, step S3 specifically includes: and the output layer node in the edge device receives and aggregates the characteristic data provided by the input layer node after disturbance noise is added according to a preset second time window, and performs characteristic reconstruction on the characteristic data received in the second time window through a preset decoder to obtain reconstructed data.
A data stream differential privacy protection system based on edge computing, comprising an edge device for: receiving characteristic data acquired by terminal equipment and subjected to characteristic extraction through a preset encoder; aggregating the characteristic data and adding disturbance noise; performing feature reconstruction on the feature data added with the disturbance noise through a preset decoder to obtain reconstructed data; the encoder and the decoder are obtained by training the same self-encoder.
Further, the edge device includes an input layer node, where the input layer node is configured to aggregate the received feature data acquired by each terminal device in a first preset time window, calculate a disturbance noise budget for each feature data, and add disturbance noise to the feature data according to the disturbance noise budget.
Further, the disturbance noise budget is calculated and determined according to equation (1):
Figure GDA0002621498980000041
in the formula (1), n is the total number of terminal devices and n is the preset total privacy budgetkThe number of terminal devices connected to the current input layer node,kfor the privacy budget of the current input layer node, βiRepresenting the proportion of each feature in the privacy budget of the current input level node within the current first time window, d representing the dimension of the feature,
Figure GDA0002621498980000042
representing the average correlation degree of the ith input feature in the current first time window of the current input layer node, namely taking the current feature as a central point, calculating the average Euclidean distance between adjacent features, fjRepresenting the jth characteristic value within the current first time window of the current input level node,ia privacy budget for an ith input feature within a current first time window of a current input layer node.
Further, adding disturbance noise to the feature data according to equation (2):
fi'=fi+Lap(Δh0/i) (2)
in the formula (2), fi' feature value after adding disturbance noise, fiTo add the eigenvalues before the disturbance noise, Δ h0Identifying global sensitivity, Lap (-) as LapralaThe distribution of the number of lines of the channel,ia privacy budget for an ith input feature within a current first time window of a current input layer node.
Further, the edge device includes an output layer node, where the output layer node is configured to receive and aggregate the feature data, provided by the input layer node, after the disturbance noise is added according to a preset second time window, and perform feature reconstruction on the feature data received in the second time window through a preset decoder, so as to obtain reconstructed data.
And further, the system also comprises a terminal device, wherein the terminal device is used for collecting data according to a preset collection time window as a unit, performing characteristic extraction on the data in the collection time window according to a preset encoder to obtain characteristic data, and providing the characteristic data for the edge device.
Compared with the prior art, the invention has the advantages that:
1. according to the invention, the acquisition time window is set on the terminal equipment, data are acquired according to the acquisition time window, characteristic extraction is carried out, and the data are transmitted to the edge equipment for subsequent processing, the input layer node of the edge equipment receives the characteristic data sent by the terminal equipment accessed to the node according to the first time window, disturbance noise is added to each characteristic data through a self-adaptive algorithm, the output layer node of the edge equipment receives the characteristic data of the input layer node added with the disturbance noise according to the second time window, reconstruction is carried out through a decoder to obtain reconstructed data, the reconstructed data are provided for other systems to use, and the reconstructed data cannot obtain sensitive information of a user.
2. The edge device is provided with a plurality of input layer nodes, each input layer node is connected with a plurality of terminal devices, and the characteristic data of the accessed terminal devices are processed.
3. The terminal equipment aligns the acquired data through Hash, extracts the characteristics through the encoder of the terminal equipment, and transmits the characteristic data to the input layer node of the edge equipment, so that the data transmission quantity between the terminal equipment of a user and the input layer node of the edge equipment is reduced, and the waste of network bandwidth is reduced; and the encoder and the decoder are two parts in the same self-encoder, and are loaded on the terminal equipment after being trained in advance, so that the encoder does not need to be trained by the terminal equipment, and the requirement on the processing capacity of the terminal equipment is also reduced.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the present invention.
FIG. 2 is a system architecture diagram according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a self-encoder architecture according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.
As shown in fig. 1, the method for protecting data stream differential privacy based on edge calculation according to this embodiment includes: s1, edge equipment receives characteristic data which is acquired by terminal equipment and subjected to characteristic extraction through a preset encoder; s2, aggregating the characteristic data and adding disturbance noise; s3, performing characteristic reconstruction on the characteristic data added with the disturbance noise through a preset decoder to obtain reconstructed data; the encoder and the decoder are obtained by training the same self-encoder.
In this embodiment, the feature data in step S1 is obtained by the terminal device collecting the feature data in one collection time window by using a preset collection time window as a unit, and performing feature extraction by using a preset encoder. Step S2 specifically includes: and the input layer node in the edge device aggregates the received characteristic data acquired by the terminal devices in the first time window according to a preset first time window, calculates the disturbance noise budget of each characteristic data, and adds disturbance noise to the characteristic data according to the disturbance noise budget.
In this embodiment, the disturbance noise budget is calculated and determined according to equation (1):
Figure GDA0002621498980000061
in the formula (1), n is the total number of terminal devices and n is the preset total privacy budgetkThe number of terminal devices connected to the current input layer node,kfor the privacy budget of the current input layer node, βiRepresenting the proportion of each feature in the privacy budget of the current input level node within the current first time window, d representing the dimension of the feature,
Figure GDA0002621498980000062
representing the average correlation degree of the ith input feature in the current first time window of the current input layer node, namely taking the current feature as a central point, calculating the average Euclidean distance between adjacent features, fjRepresenting the jth characteristic value within the current first time window of the current input level node,ia privacy budget for an ith input feature within a current first time window of a current input layer node.
In the present embodiment, disturbance noise is added to the feature data according to equation (2):
fi'=fi+Lap(Δh0/i) (2)
in the formula (2), fi' feature value after adding disturbance noise, fiTo add the eigenvalues before the disturbance noise, Δ h0Identifying global sensitivity, Lap (-) is a Laplace distribution,ia privacy budget for an ith input feature within a current first time window of a current input layer node.
In this embodiment, step S3 specifically includes: and the output layer node in the edge device receives and aggregates the characteristic data provided by the input layer node after disturbance noise is added according to a preset second time window, and performs characteristic reconstruction on the characteristic data received in the second time window through a preset decoder to obtain reconstructed data.
A data stream differential privacy protection system based on edge computing, comprising an edge device for: receiving characteristic data acquired by terminal equipment and subjected to characteristic extraction through a preset encoder; aggregating the characteristic data and adding disturbance noise; performing feature reconstruction on the feature data added with the disturbance noise through a preset decoder to obtain reconstructed data; the encoder and the decoder are obtained by training the same self-encoder.
In this embodiment, the edge device includes an input layer node, where the input layer node is configured to aggregate, according to a preset first time window, the feature data acquired by each terminal device and received in the first time window, calculate a disturbance noise budget for each feature data, and add disturbance noise to the feature data according to the disturbance noise budget.
Further, the disturbance noise budget is calculated and determined according to equation (1):
Figure GDA0002621498980000071
in the formula (1), n is the total number of terminal devices and n is the preset total privacy budgetkThe number of terminal devices connected to the current input layer node,kfor the privacy budget of the current input layer node, βiRepresenting the proportion of each feature in the privacy budget of the current input level node within the current first time window, d representing the dimension of the feature,
Figure GDA0002621498980000072
representing the average degree of correlation of the ith input feature in the current first time window of the current input layer node, namely taking the current feature as a central point, calculating the average degree of correlation between adjacent featuresMean Euclidean distance, fjRepresenting the jth characteristic value within the current first time window of the current input level node,ia privacy budget for an ith input feature within a current first time window of a current input layer node.
In the present embodiment, disturbance noise is added to the feature data according to equation (2):
fi'=fi+Lap(Δh0/i) (2)
in the formula (2), fi' feature value after adding disturbance noise, fiTo add the eigenvalues before the disturbance noise, Δ h0Identifying global sensitivity, Lap (-) is a Laplace distribution,ia privacy budget for an ith input feature within a current first time window of a current input layer node.
In this embodiment, the edge device includes an output layer node, where the output layer node is configured to receive and aggregate feature data, which is provided by the input layer node and to which the disturbance noise is added, according to a preset second time window, and perform feature reconstruction on the feature data received in the second time window through a preset decoder, so as to obtain reconstructed data.
In this embodiment, the system further includes a terminal device, where the terminal device is configured to collect data according to a preset collection time window as a unit, perform feature extraction on the data in the collection time window according to a preset encoder, obtain feature data, and provide the feature data to the edge device.
In this embodiment, 10000 pieces of actual data in an application scenario of a taxi taking in a city are taken as an example for explanation, and the experimental data includes 17 fields: the cloud terminal comprises a media (value md5 of a vehicle binding identifier), a hack _ license (value md5 of a taxi driving license binding identifier), a pickup _ datatime (time of getting on the vehicle of a passenger), a drop _ datatime (time of getting off the vehicle of the passenger), a trip _ time _ in _ games (time of taking a bus), a trip _ distance (driving distance), a fare _ amount, a surcharge (extra fee), an mta _ tax, a tip _ amount (small fee), a tolls _ amount (passage fee), a total _ amount (total amount of all fees) and the like, wherein the query purpose of the cloud terminal is to count the sum of the car taking fees in each time window. Since the cloud needs to query the total sum of the bus cost, fields related to time and cost need to be reserved: pickup _ datatime, dropoff _ datatime, fare _ estimate, subcharge, mta _ tax, tip _ estimate, tolls _ estimate, and total _ estimate.
In the application scenario of this embodiment, the system architecture is as shown in fig. 2, and includes a plurality of terminal devices (smart phones) and edge devices composed of a plurality of PCs, and the switch and the high-speed network are used to implement mutual communication. The edge device comprises a plurality of input layer nodes and an output layer node, each input layer node is in network connection with a plurality of terminal devices, receives the characteristic data sent by the terminal devices, aggregates the characteristic data, and adds disturbance noise (differential disturbance and differential privacy disturbance). The output layer nodes are connected with the input layer nodes and used for receiving the data connected with the input layers and added with the disturbance noise, performing aggregation and feature reconstruction, and outputting the reconstructed data after the feature reconstruction so as to provide the reconstructed data for other equipment and systems (such as cloud). The terminal device and the input layer node of the edge device are in a many-to-one relationship, that is, one terminal device corresponds to one input layer node, and one input layer node corresponds to a plurality of terminal devices. And data stream transmission is carried out between the terminal equipment and the input layer node of the edge equipment and between the input layer node and the output layer node of the edge equipment.
In the application scenario of this embodiment, the terminal device has data acquisition and feature extraction functions in software, and transmits feature data to the input layer node of the edge device by calling the API of the edge device platform. The edge device is composed of kafka in software to form a distributed computing framework, data are stored in kafka brokers, logic nodes of the edge device correspond to topic in kafka, corresponding tasks (task) are executed after data streams flow through topic, data stream aggregation is executed by an input layer node, differential privacy disturbance is added in a self-adaptive mode, and data stream aggregation and feature reconstruction are executed by an output layer node. Through the above processes, the data stream output by the edge device meets the definition of the differential privacy, and the transparency of the sensitive information to cloud analysis is ensured.
In the application scenario of the embodiment, the application of the self-encoder relates to the terminal device and the edge device, and the non-complete self-encoder is preferably adopted in view of the reduction of the data volume and the addition of the differential privacy disturbance. The encoder in the non-self-contained encoder can achieve the effect similar to Principal Component Analysis (Principal Component Analysis), and extract the main features in the data. In an embodiment of the present invention, it is preferable to use a non-self-contained encoder architecture as shown in fig. 3, where the encoder has 4 layers of neurons (not containing input layers) and the number of each layer of neurons is (6, 5, 3, 3), and the decoder has 4 layers of neurons (not containing input layers) and the number of each layer of neurons is (3, 4, 5, 8). The self-encoder training adopts an off-line training mode, namely, a data set is used for training the self-encoder in advance to obtain a trained incomplete self-encoder.
In the application scenario of this embodiment, the trained encoder neuron (i.e. encoder) of the incomplete self-encoder runs in the terminal device, and the decoder neuron (i.e. decoder) runs on the last logical node (i.e. output layer of the edge device) of the edge device as shown in fig. 2, so as to reconstruct the feature. By opening the encoder and decoder to the end device and the edge device, the amount of data transmitted can be reduced. In order to secure the user data, the present invention preferably employs adding disturbance noise for the feature data on the edge device when adding disturbance noise satisfying the differential privacy.
In the application scenario of this embodiment, after the reserved fields are determined, the non-complete auto-encoder needs to be trained, in order to enable the selected field data to be input into the auto-encoder for training, each field needs to be converted into a string with a fixed length of k bits, that is, in this embodiment, each field is aligned by using a hash algorithm to obtain a string with k bits, the fields of each message in the data set are hash-aligned and then form a new row of records, each aligned record is combined into a message matrix through matrix operation, each message matrix is also combined into a final training set matrix through matrix operation, and finally, the training set is input into the non-complete auto-encoder for training, where the loss function is L (x, g (f (z (x))), where L () usually adopts a mean square error function, and g (-) is a decoder, f (-) is the encoder and z (-) is the hash alignment operation. The encoder in the trained self-encoder runs in each terminal device, namely each terminal device has a copy of the neural network model of the encoder, and the decoder runs in the edge device, namely, only one copy of the decoder runs on the edge device.
In the application scenario of this embodiment, as shown in fig. 2, the terminal device is a smart phone, and the steps of data acquisition, feature extraction, and the like are all implemented by software. The whole data acquisition and feature extraction process of the smart phone takes a preset acquisition time window as a unit, and the acquisition time windows among different mobile phones are asynchronously executed, namely, communication coordination is not needed among the mobile phones. The specific process is as follows: for a certain mobile phone, in an acquisition time window, data to be acquired is continuously acquired and cached at a smaller time interval, and only relevant fields needing to be reserved are cached when the data are acquired. Considering the processing performance constraint of the mobile phone, in this embodiment, a batch processing mode is preferably adopted, the cached data is processed according to the batch, when the cached data amount reaches a batch size, the hash alignment is immediately performed on the data of the batch, and the aligned data is input into the encoder neural network extraction feature; and when the acquisition time is equal to or exceeds the threshold value of the acquisition time window, carrying out hash alignment operation and feature extraction regardless of whether the amount of the remaining acquired data meets one batch, and finally, sending all feature data extracted in the current acquisition time window to the terminal equipment.
In the application scenario of the embodiment, the edge device is composed of a plurality of PCs, and is used for receiving feature data transmitted by different terminal devices, adding disturbance noise (differential disturbance and differential privacy disturbance) to the feature data to meet the differential privacy, and finally reconstructing the feature data to facilitate subsequent analysis of the cloud. Since the edge device is not a high-performance computer in the cloud, there is a limit in performance and storage capacity. Therefore, in the embodiment, the edge device adopts a distributed computing framework, that is, a kafka data stream processing framework is deployed on a plurality of PCs, the kafka framework is based on a zookeeper framework, the zookeeper framework is a centralized service and is used for maintaining configuration information, naming, providing distributed synchronization and providing group services, the kafka can realize distributed storage and redundant backup of data by using the zookeeper, the redundant number of data can be set by using a zookeeper configuration file, the problem of storage capacity limitation of a single device is solved, and the problem caused by device performance limitation is well solved by using the kafka to realize data stream distributed processing. In this embodiment, the edge device implements distributed computation and stream data processing by using a kafka data stream framework, as shown in fig. 2, kafka topic corresponds to logical nodes of the edge device one to one, a node in the edge device for receiving a data stream of a terminal device is an input layer node, and is used for adding aggregation and differential privacy disturbance to the data stream, a node in the edge device for outputting data is an output layer node, and is used for aggregating data streams output by the input layer node, and is responsible for aggregation of the data streams and reconstruction of feature data, similar to the terminal device, the input layer node and the output layer node each have their own corresponding time windows, that is, asynchronous and identical time windows are provided between the input layer nodes, that is, first time windows of the input layer nodes are identical but not synchronously executed, only one output layer node is provided, and a second time window of the output layer node is independent from the input layer node, i.e. the first time window and the second time window are independent of each other. It should be noted that the edge device shown in fig. 2 is not a physical architecture, but a logical architecture, that is, a distributed data stream processing platform is physically formed by multiple PCs together, and does not have the hierarchical architecture shown in fig. 2.
In the application scenario of this embodiment, a plurality of terminal devices (smartphones) wirelessly transmit data after feature extraction to topic corresponding to an input layer node of an edge device through a kafka producer api interface, each logic node of the input layer continuously receives and caches a feature data stream sent from the smartphone in a first time window, the feature data streams of different smartphones are extracted, aggregated and cached by using the kafka streams api, and when the time spent in the process is equal to or greater than the first time window threshold, data reconstructed subsequently is enhancedAccording to a formula, calculating the self-adaptive disturbance noise budget of the characteristic value received by the current logic nodeiAdding disturbance noise into the cached data, combining the data with the disturbance noise into a new data stream, and then transmitting the new data stream to the topic corresponding to the node of the output layer; and the output layer node also aggregates the feature data streams after the disturbance is added to different logic nodes of the input layer in a second time window of the output layer node, obtains and caches specific data in the data streams by utilizing the kafka consumer api, converts the cached data in the second time window into a matrix form to be input into a decoder neural network of a training model before for feature reconstruction when the time spent in the process is equal to or greater than a second time window threshold value, and finally outputs the matrix form to a remote cloud server for data analysis.
In the application scenario of this embodiment, how to add the differential privacy disturbance noise affects the security and usability of the feature reconstructed data. In the prior art, the same disturbance is added to each feature value, however, the reality shows that the contribution of each feature value to the decoder output is not the same, so that in the embodiment, the disturbance noise is added by using an adaptive algorithm, under the condition of ensuring security (fixing the total privacy budget), as many disturbances as possible are added to the features with small influence on the feature reconstruction data, and as few disturbances as possible are added to the features with large influence, thereby improving the usability of the reconstruction data. By adopting the formulas of the formula (1) and the formula (2), disturbance noise is added to the characteristic data, and the safety of the data can be well ensured.
The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims (8)

1. A data flow differential privacy protection method based on edge calculation is characterized in that:
s1, edge equipment receives characteristic data which is acquired by terminal equipment and subjected to characteristic extraction through a preset encoder;
s2, aggregating the characteristic data and adding disturbance noise, and specifically comprising the following steps: according to a preset first time window, the input layer nodes in the edge device aggregate the received characteristic data collected by the terminal devices in the first time window, calculate disturbance noise budgets of the characteristic data, and add disturbance noise to the characteristic data according to the disturbance noise budgets;
s3, performing characteristic reconstruction on the characteristic data added with the disturbance noise through a preset decoder to obtain reconstructed data;
the encoder and the decoder are obtained by training the same self-encoder;
the disturbance noise budget is calculated and determined according to the formula (1):
Figure FDA0002621498970000011
in the formula (1), n is the total number of terminal devices and n is the preset total privacy budgetkThe number of terminal devices connected to the current input layer node,kfor the privacy budget of the current input layer node, βiRepresenting the proportion of each feature in the privacy budget of the current input level node within the current first time window, d representing the dimension of the feature,
Figure FDA0002621498970000012
representing the average correlation degree of the ith input feature in the current first time window of the current input layer node, namely taking the current feature as a central point, calculating the average Euclidean distance between adjacent features, fjRepresenting the jth characteristic value within the current first time window of the current input level node,ia privacy budget for an ith input feature within a current first time window of a current input layer node.
2. The edge-computation-based data flow differential privacy protection method of claim 1, wherein:
in step S1, the feature data is acquired by the terminal device in one acquisition time window by using a preset acquisition time window as a unit, and is obtained after feature extraction is performed by a preset encoder.
3. The edge-computation-based data flow differential privacy protection method of claim 2, wherein: adding disturbance noise to the feature data according to equation (2):
fi'=fi+Lap(Δh0/i) (2)
in the formula (2), fi' feature value after adding disturbance noise, fiTo add the eigenvalues before the disturbance noise, Δ h0Identifying global sensitivity, Lap (-) is a Laplace distribution,ia privacy budget for an ith input feature within a current first time window of a current input layer node.
4. The edge-computation-based data flow differential privacy protection method of claim 3, wherein: step S3 specifically includes: and the output layer node in the edge device receives and aggregates the characteristic data provided by the input layer node after disturbance noise is added according to a preset second time window, and performs characteristic reconstruction on the characteristic data received in the second time window through a preset decoder to obtain reconstructed data.
5. A data flow differential privacy protection system based on edge computation, characterized by: comprising an edge device for: receiving characteristic data acquired by terminal equipment and subjected to characteristic extraction through a preset encoder; aggregating the characteristic data and adding disturbance noise, specifically comprising: the edge device comprises an input layer node, wherein the input layer node is used for aggregating the received characteristic data collected by the terminal devices in a first time window according to a preset first time window, calculating disturbance noise budget of each characteristic data, and adding disturbance noise to the characteristic data according to the disturbance noise budget; performing feature reconstruction on the feature data added with the disturbance noise through a preset decoder to obtain reconstructed data; the encoder and the decoder are obtained by training the same self-encoder;
the disturbance noise budget is calculated and determined according to the formula (1):
Figure FDA0002621498970000021
in the formula (1), n is the total number of terminal devices and n is the preset total privacy budgetkThe number of terminal devices connected to the current input layer node,kfor the privacy budget of the current input layer node, βiRepresenting the proportion of each feature in the privacy budget of the current input level node within the current first time window, d representing the dimension of the feature,
Figure FDA0002621498970000022
representing the average correlation degree of the ith input feature in the current first time window of the current input layer node, namely taking the current feature as a central point, calculating the average Euclidean distance between adjacent features, fjRepresenting the jth characteristic value within the current first time window of the current input level node,ia privacy budget for an ith input feature within a current first time window of a current input layer node.
6. The edge-computation-based data-flow differential privacy protection system of claim 5, wherein: adding disturbance noise to the feature data according to equation (2):
fi'=fi+Lap(Δh0/i) (2)
in the formula (2), fi' feature value after adding disturbance noise, fiTo add the eigenvalues before the disturbance noise, Δ h0Identifying global sensitivity, Lap (-) is a Laplace distribution,ia privacy budget for an ith input feature within a current first time window of a current input layer node.
7. The edge-computation-based data-flow differential privacy protection system of claim 6, wherein: the edge device comprises an output layer node, wherein the output layer node is used for receiving and aggregating the characteristic data which is provided by the input layer node and added with the disturbance noise according to a preset second time window, and performing characteristic reconstruction on the characteristic data received in the second time window through a preset decoder to obtain reconstructed data.
8. The edge-computation-based data stream differential privacy protection system of any one of claims 5 to 7, wherein:
the edge device is used for acquiring data according to a preset acquisition time window as a unit, performing characteristic extraction on the data in the acquisition time window according to a preset encoder to obtain characteristic data and providing the characteristic data for the edge device.
CN201811379012.4A 2018-11-19 2018-11-19 Data stream differential privacy protection method and system based on edge calculation Active CN109495476B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811379012.4A CN109495476B (en) 2018-11-19 2018-11-19 Data stream differential privacy protection method and system based on edge calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811379012.4A CN109495476B (en) 2018-11-19 2018-11-19 Data stream differential privacy protection method and system based on edge calculation

Publications (2)

Publication Number Publication Date
CN109495476A CN109495476A (en) 2019-03-19
CN109495476B true CN109495476B (en) 2020-11-20

Family

ID=65696894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811379012.4A Active CN109495476B (en) 2018-11-19 2018-11-19 Data stream differential privacy protection method and system based on edge calculation

Country Status (1)

Country Link
CN (1) CN109495476B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110300159B (en) * 2019-06-10 2021-08-31 华侨大学 Sensing cloud data safe low-cost storage method based on edge computing
CN110213036B (en) * 2019-06-17 2021-07-06 西安电子科技大学 Safe data storage and calculation method based on fog calculation-edge calculation of Internet of things
CN110443063B (en) * 2019-06-26 2023-03-28 电子科技大学 Adaptive privacy-protecting federal deep learning method
CN111222532B (en) * 2019-10-23 2024-04-02 西安交通大学 Training method for edge cloud collaborative deep learning model with classification precision maintenance and bandwidth protection
CN111082997B (en) * 2019-12-30 2021-05-14 西安电子科技大学 Network function arrangement method based on service identification in mobile edge computing platform
CN111401272B (en) * 2020-03-19 2021-08-24 支付宝(杭州)信息技术有限公司 Face feature extraction method, device and equipment
CN111914285B (en) * 2020-06-09 2022-06-17 深圳大学 Geographic distributed graph calculation method and system based on differential privacy
CN112541574B (en) * 2020-12-03 2022-05-17 支付宝(杭州)信息技术有限公司 Privacy-protecting business prediction method and device
CN114936650A (en) * 2020-12-06 2022-08-23 支付宝(杭州)信息技术有限公司 Method and device for jointly training business model based on privacy protection
CN116049840B (en) * 2022-07-25 2023-10-20 荣耀终端有限公司 Data protection method, device, related equipment and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108093401A (en) * 2017-12-13 2018-05-29 电子科技大学 A kind of mobile intelligent terminal intimacy protection system and method based on edge calculations
CN108234493A (en) * 2018-01-03 2018-06-29 武汉大学 The space-time crowdsourcing statistical data dissemination method of secret protection under insincere server
CN108734217A (en) * 2018-05-22 2018-11-02 齐鲁工业大学 A kind of customer segmentation method and device based on clustering

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10776242B2 (en) * 2017-01-05 2020-09-15 Microsoft Technology Licensing, Llc Collection of sensitive data—such as software usage data or other telemetry data—over repeated collection cycles in satisfaction of privacy guarantees
US10380366B2 (en) * 2017-04-25 2019-08-13 Sap Se Tracking privacy budget with distributed ledger
CN107358113A (en) * 2017-06-01 2017-11-17 徐州医科大学 Based on the anonymous difference method for secret protection of micro- aggregation
CN108011948B (en) * 2017-11-30 2021-01-05 成都航天科工大数据研究院有限公司 Industrial equipment integration monitored control system based on edge calculation
CN108280491B (en) * 2018-04-18 2020-03-06 东莞市盟大塑化科技有限公司 K-means clustering method for differential privacy protection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108093401A (en) * 2017-12-13 2018-05-29 电子科技大学 A kind of mobile intelligent terminal intimacy protection system and method based on edge calculations
CN108234493A (en) * 2018-01-03 2018-06-29 武汉大学 The space-time crowdsourcing statistical data dissemination method of secret protection under insincere server
CN108734217A (en) * 2018-05-22 2018-11-02 齐鲁工业大学 A kind of customer segmentation method and device based on clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Distilling at the Edge:A Local Differential Privacy Obfuscation Framework for IoT Data Analytics;Chugui Xu;《IEEE Communications Magazine》;20180814;第56卷(第8期);20-25 *

Also Published As

Publication number Publication date
CN109495476A (en) 2019-03-19

Similar Documents

Publication Publication Date Title
CN109495476B (en) Data stream differential privacy protection method and system based on edge calculation
CN110874440B (en) Information pushing method and device, model training method and device, and electronic equipment
CN107306355B (en) A kind of content recommendation method and server
US20220294821A1 (en) Risk control method, computer device, and readable storage medium
CN109936512B (en) Flow analysis method, public service flow attribution method and corresponding computer system
US11751004B2 (en) Methods and systems for communication management
KR20140051447A (en) Cloud computing enhanced gateway for communication networks
CN110741573A (en) Method and system for selectively propagating transactions using network coding in a blockchain network
CN106097019A (en) Virtual objects packet transmission method, device and system
CN103024819B (en) Data distribution method of third-generation mobile communication core network based on user terminal IP (Internet Protocol)
CN102355501A (en) Data processing method, access review equipment and system
CN109063157A (en) Resource recommendation method and its device, equipment/terminal/server, computer-readable medium
Guo et al. Slim-FCP: Lightweight-feature-based cooperative perception for connected automated vehicles
CN105721392B (en) A kind of method, apparatus and system for recommending application
CN109005060B (en) Deep learning application optimization framework based on hierarchical highly heterogeneous distributed system
US20230379763A1 (en) Dynamic continuous quality of service adjustment system
US11100051B1 (en) Management of content
CN114782077A (en) Information screening method, model training method, device, electronic equipment and medium
KR102385702B1 (en) Data analysis service method and data analysis service system using the method
CN104618743B (en) Code check resource allocation methods, apparatus and system
CN110784552B (en) Information pushing method, device, equipment and medium
CN106959865B (en) Data acquisition and information push method and device and electronic equipment
CN110225568A (en) Based on the MTC gateway selection method and equipment that energy consumption is minimum under non-orthogonal multiple
CN105791086B (en) A kind of information processing method, first terminal and second terminal
CN112613007B (en) Data admission method and device based on trusted authentication and related products

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant