US20230060864A1 - Data collection and analysis system and device - Google Patents

Data collection and analysis system and device Download PDF

Info

Publication number
US20230060864A1
US20230060864A1 US17/969,447 US202217969447A US2023060864A1 US 20230060864 A1 US20230060864 A1 US 20230060864A1 US 202217969447 A US202217969447 A US 202217969447A US 2023060864 A1 US2023060864 A1 US 2023060864A1
Authority
US
United States
Prior art keywords
data stream
character
original
random number
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/969,447
Inventor
Yao-Tung TSOU
Hao Zhen
Ching-Ray Chang
Sy-Yen Kuo
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Etron Technology Inc
Original Assignee
Etron Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Etron Technology Inc filed Critical Etron Technology Inc
Priority to US17/969,447 priority Critical patent/US20230060864A1/en
Publication of US20230060864A1 publication Critical patent/US20230060864A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/58Random or pseudo-random number generators
    • G06F7/588Random number generators, i.e. based on natural stochastic processes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/58Random or pseudo-random number generators
    • G06F7/582Pseudo-random number generators
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/06Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
    • H04L9/0643Hash functions, e.g. MD5, SHA, HMAC or f9 MAC
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0861Generation of secret information including derivation or calculation of cryptographic keys or passwords
    • H04L9/0866Generation of secret information including derivation or calculation of cryptographic keys or passwords involving user or device identifiers, e.g. serial number, physical or biometrical information, DNA, hand-signature or measurable physical characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2209/00Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
    • H04L2209/42Anonymization, e.g. involving pseudonyms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2209/00Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
    • H04L2209/80Wireless
    • H04L2209/805Lightweight hardware, e.g. radio-frequency identification [RFID] or sensor

Definitions

  • the present invention relates to a data collection and analysis method and a related device thereof, and particularly to a method and a related device that can utilize a first noise step and a second noise step to de-identify identification information in an original data stream.
  • randomized response mechanism [17] has drawn considerable interest from the theory community, which can address the aforementioned dilemma.
  • the concept of randomized response is to induce noise locally before sharing data with any curators.
  • randomized response mechanisms can provide a rigorous privacy guarantee by the definition of local differential privacy while having broader application scenarios.
  • randomized response mechanisms can provide a rigorous privacy guarantee while satisfying differential privacy.
  • individuals have “plausible deniability” that attackers cannot infer any sensitive information with high confidence, regardless of the attackers' background knowledge.
  • Randomized response was proposed by Warner [17] in 1965 as a survey technique for sensitive questions. After more than 40 years, Dowrk et al. proposed a robust and mathematically rigorous definition of privacy in [7] and formally named it; they also proposed the definition of differential privacy in [6].
  • the local model of private learning was first described by Kasiviswanathan et al. [16], who pioneered the practice of connecting the randomized response to differential privacy. Later, Chan et al. [18] proved that randomized response has optimal lower bound in the locally differentially private model, referred to as local differential privacy.
  • LDPMiner devised a two-phase mechanism named LDPMiner, which first uses part of the privacy budget a [6] to generate a candidate set of heavy hitters and then uses the remaining part to refine the results.
  • LDPMiner expanded the application scenario of RAPPOR, which focused on heavy hitter discovery in set-valued data instead of categorical data.
  • PRNGs pseudorandom number generators
  • CSPRNGs cryptographically secure pseudo-random number generators
  • An embodiment of the present invention provides a data collection and analysis method.
  • the method includes applying a first noise step to an original data stream with an original character to generate a first data stream with a first character; and applying a second noise step to the first data stream to generate a second data stream with a second character, wherein a first variation between the original character and the first character is greater than a second variation between the original character and the second character.
  • Another embodiment of the present invention provides a data collection and analysis method.
  • the method includes applying a first noise step to an original data stream with a featured distribution to generate a first data stream with a first distribution; and applying a second noise step to the first data stream to generate a second data stream with a second distribution, wherein a first variation between the featured distribution and the first distribution is greater than a second variation between the featured distribution and the second distribution.
  • the device includes a first processor and a second processor.
  • the first processor applies a first noise step to an original data stream with an original character to generate a first data stream with a first character; and the second processor applies a second noise step to the first data stream to generate a second data stream with a second character, wherein a first variation between the original character and the first character is greater than a second variation between the original character and the second character.
  • the device includes a true random number generator and a processor unit.
  • the true random number generator generates a plurality of random number without the need of a seed.
  • the processor unit based on the plurality of random number, de-identifies identification information in an original data stream with an original character and generating a second data stream with a second character.
  • the second character is substantially similar to the original character.
  • TRNGs truly random number generators
  • TRNGs should be considered as a fundamental building block of privacy-preserving mechanisms.
  • TRNGs are implemented in hardware and generate a sequence of random numbers using a nondeterministic physical event such as magnetization variation of a ferromagnetic layer and flux transmission in a single flux quantum device.
  • the initial state in TRNGs is truly unknown while the initial state in PRNGs/CSPRNGs must be manually kept secret.
  • the major drawback of TRNGs is scalability, which is important especially for Internet of Things (IoT), expected to handle a growing amount of data securely.
  • a magnetic tunnel junction (MJT) can be referred as a spintronics-based TRNG.
  • Binary random bits are generated by using the stochastic nature of spin-transfer-torque (STT) switching in MJTs. Owing to the scalability of spin-torque switching [12], the MTJ can operate as a scalable TRNG and can be integrated on a chip in high density and low power consumption.
  • STT spin-transfer-torque
  • the intuitive way is to shuffle the primitive data while ensuring the randomness inside the algorithms through a series of elaborated encoding techniques and analysis mechanisms.
  • the present invention establishes the spintronics-based private aggregatable randomized response (SPARR), an advanced data collection and analysis mechanism that conforms to the differential privacy requirements by using a multilayer randomized response based on a set of MTJs.
  • SPARR spintronics-based private aggregatable randomized response
  • the main contributions of the present invention include: 1) the present invention proposes the multilayer randomized response mechanism, which can significantly improve the accuracy of data analysis while satisfying the definition of local differential privacy; 2) the present invention leverages a set of MTJs as a TRNG to generate unpredictable random bits and design an approach to convert random bits into random numbers between 0 and 1, in which the TRNG can be seamlessly integrated with multilayer randomized response mechanism and used to strengthen the randomness of our algorithm's outputs; 3) the present invention evaluates the method by a sequence of experiments in both simulation and real-world environment to verify that the method outperforms prior works.
  • FIG. 1 is a diagram illustrating a model of crowd sensing and collection for SPARR.
  • FIG. 2 is a diagram illustrating Hash encoding, the permanent randomized response (PRR), the instantaneous randomized response (IRR), and the synthetic randomized response (SRR).
  • FIG. 3 is a diagram illustrating the original data stream having the featured distribution, the first data stream having the first distribution, and the second data stream having the second distribution.
  • FIG. 4 is a diagram illustrating the multilayer randomized response from the perspective of conditional probability.
  • FIGS. 5 A- 5 C are diagrams illustrating comparison of the false negative rate, total variation distance, and allocated mass in varying k, m, and N.
  • FIGS. 6 A- 6 C are diagrams illustrating comparison of the false negative rate, total variation distance, and allocated mass in varying ⁇ .
  • FIG. 8 is a diagram illustrating comparison of SPARR and RAPPOR when using Kosarak under different ⁇ .
  • SPARR mainly focuses on two aspects that are distinct from the aforementioned schemes: (1) the present invention employs a set of MTJs as the spintronics-based TRNG, which can provide the rigorous privacy protection; (2) the present invention proposes a multilayer randomized response mechanism to protect the data privacy and improve the data utility, and use the false negative rate, the total variation distance, and the allocated mass as metrics of the present invention to prove that SPARR can achieve significant favorable performance than prior works.
  • the present invention considers a model, composed of unconditionally trusted clients 102 (data generation), and semi-trusted storage servers 104 (data collection) and analysts 106 (data analysis) as shown in FIG. 1 .
  • the authorization between clients and analysts is appropriately conducted off-line or on-line.
  • the authorization is out of the scope of the present invention. More details regarding to the authorization of clients can be referred to [19].
  • storage servers can collect sanitized values and strings transmitted from large numbers of clients. Moreover, analysts are permitted to do statistics on these client-side sanitized values and strings, such as histograms, frequencies, and other statistic indicators for finding their app preferences, historical activities, or other information. For any given sanitized value or string reported, SPARR can guarantee a strong plausible deniability for the reporting client through a sequence of encoding steps, as measured by an ⁇ -differential privacy bound. Doing so, SPARR strictly limits private information leaked.
  • client-side private data can be disclosed by many ways. Assuming that storage servers and analysts are honest-but-curious, they may leak private information unintentionally by publishing data analyses or may violate the privacy intentionally by gathering sensitive data. There are several attack types, such as the attacker may poach data stored on servers or attempt to eavesdrop on communication between clients and servers. For remedying these attacks, the present invention adopts a local privacy-preserving mechanism that is implemented on each client and sanitizes any information before it is outsourced by the client.
  • the local privacy preserving mechanism satisfying the definition of ⁇ -differential privacy (called local differential privacy) can have rigorous privacy guarantee, regardless of the attackers' background knowledge.
  • FIG. 2 is a diagram illustrating Hash encoding, permanent randomized response (PRR), instantaneous randomized response (IRR), and synthetic randomized response (SRR), wherein detailed descriptions of FIG. 2 are as follows.
  • N represents the number of reports.
  • m represents the number of cohorts.
  • h represents the number of hash functions (that is, Hash encoding), wherein a pre-processor receives an input data stream (e.g. the client-side private data stream) and utilizes Hash encoding to the input data stream to generate an original data stream with an original character (e.g. positions of “1” of the original data stream shown in FIG. 2 ), wherein the input data stream correspond to a plurality of users.
  • the pre-processor can be a field programmable gate array (FPGA) with the above mentioned functions of the pre-processor, or an application-specific integrated circuit (ASIC) with the above mentioned functions of the pre-processor, or a software module with the above mentioned functions of the pre-processor.
  • F represents the privacy budget of differential privacy.
  • q′ represents the probability of generating 1 in the report s i ′, if the Bloom filter bit b i is set to 1.
  • p′ represents the probability of generating 1 in the report s i ′, if the Bloom filter bit b i is set to 0.
  • A represents the number of unique client-side strings.
  • a randomized algorithm M is ⁇ -differentially private if for all S M ⁇ Range (M) and the neighboring datasets D1 and D2 (shown in equation (1)).
  • the probability is over the coin flips of the mechanism M, and ⁇ is called privacy budget and determines the extent of privacy leakage. A smaller F will provide better privacy with the price of lower accuracy.
  • the local model of differential privacy [16], namely local differential privacy, considers a situation in which there is no trusted curator. Individuals hold their own private data and release it to curators in a differentially private manner.
  • the dataset D will evolve into a sequence of client strings d, and the neighboring datasets D1 and D2 will also evolve into two distinct strings d1 and d2. Therefore, a local randomized algorithm M is ⁇ -differential privacy if for all S M ⁇ Range (M) and every pair of distinct strings d1 and d2 (shown in equation (2).
  • Randomized response is a technique developed long before differential privacy. It uses secret coin flips as random events to generate the answers to sensitive questions, such as “Are you a homosexual?” An individual would answer this truthfully only if the coin is tail. Otherwise, the individual will flip a second coin to determine the fake answer, and respond “Yes” if head and “No” if tail. Randomized response is a proven efficient mechanism that satisfies local differential privacy [18].
  • SPARR includes two key elements, namely multilayer randomized response and spintronics-based encoding, to provide a high-utility truly randomized response with a rigorous data privacy guarantee.
  • the first layer is called the permanent randomized response (PRR), which is similar to the initial randomized response in Section III-A.
  • PRR permanent randomized response
  • the result of PRR, b i ′ is generated by coins 1 and 2, where the first coin is an unfair coin that comes up as heads with probability f. If the result of a coin flip is head, b i ′ will be determined by the second coin with fair probability.
  • the second layer the instantaneous randomized response (IRR) is created to protect longitudinal security [4], wherein a first processor can apply the permanent randomized response at least one time to the original data stream based on a first random number set generated by a true random number generator to generate a temporal data stream (shown in FIG. 2 ), and apply the instantaneous randomized response at least one time to the temporal data stream based on a second random number set generated by the true random number generator to generate a first data stream with a first character (shown in FIG. 2 ).
  • IRR instantaneous randomized response
  • the permanent randomized response and the instantaneous randomized response are included in a first noise step, and identification information in the original data stream is de-identified after the first processor applies the first noise step to the original data stream.
  • the first processor can be a field programmable gate array with the above mentioned functions of the first processor, or an application-specific integrated circuit with the above mentioned functions of the first processor, or a software module with the above mentioned functions of the first processor.
  • the last layer the synthetic randomized response (SRR), is constructed in SPARR on the basis of PRR and IRR to strengthen the features in b i , being kept in s i ′ while preserving the randomness of the results, wherein a second processor applies the synthetic randomized response at least one time to the first data stream based on a third random number set generated by the true random number generator to generate a second data stream with a second character (shown in FIG. 2 ), and the synthetic randomized response is included in a second noise step.
  • a first variation between the original character and the first character is greater than a second variation between the original character and the second character (shown in FIG. 2 ).
  • the synthetic randomized response can recovery and intensify the original character to make the second data stream approach the original data stream.
  • positions of “1” in the original data stream are similar to positions of “1” in the second data stream.
  • the present invention therefore employs s i ′ to efficiently reconstruct the client-side strings, even though these strings have a low frequency.
  • SRR operates the last coin, as shown in Table I, wherein a function of SRR is used for reducing shift caused by PRR and IRR.
  • the present invention designs the weight of this coin through the synthetic consideration of the values of b i , b i ′, and s i .
  • an output circuit can output the original data stream, the first data stream, and the second data stream to server(s) on the Internet.
  • FIG. 3 is a diagram illustrating the original data stream having the featured distribution, the first data stream having the first distribution, and the second data stream having the second distribution.
  • the pre-processor utilizes Hash encoding to input data streams (e.g. client-side private data streams) to generate the original data stream
  • the original data stream has the featured distribution
  • the first processor applies the permanent randomized response to the original data stream and applies the instantaneous randomized response to the temporal data stream to generate the first data stream
  • the first data stream has the first distribution
  • the second processor applies the synthetic randomized response to the first data stream to generate the second data stream
  • the second data stream has the second distribution.
  • the server(s) on the Internet receives original data streams, first data streams, and second data streams
  • the server(s) can plot FIG. 3 according to featured distribution of the original data streams, first distribution of the first data streams, and second distribution of the second data streams, wherein as shown in FIG. 3 , a third variation between the featured distribution and the first distribution is greater than a fourth variation between the featured distribution and the second distribution.
  • a data collection and analysis device includes a true random number generator, a processor unit, a pre-processor, and an output circuit, wherein the processor unit includes a first processor and a second processor.
  • the true random number generator can generate a plurality of random number without the need of a seed (e.g. a first random number set, a second random number set, and a third random number set).
  • the pre-processor receives an input data stream (e.g. the client-side private data stream) and utilizes Hash encoding to the input data stream to generate an original data stream with an original character (e.g. positions of “1” of the original data stream shown in FIG. 2 ).
  • the first processor can apply the first noise step (PRR and IRR) to the original data stream based on the first random number set and the second random number set to de-identify the original data stream to generate a first data stream with a first character; and after the first data stream is generated, the second processor can apply the second noise step (SRR) to the first data stream based on the third random number set to generate a second data stream with a second character, wherein a first variation between the original character and the first character is greater than a second variation between the original character and the second character.
  • the output circuit can output the second data stream to a remote server (on the Internet).
  • the present invention can quantitatively interpret SPARR from the perspective of conditional probability, as shown in FIG. 4 .
  • denotes probability P ⁇ si
  • the present invention has lemmas 1 and 2 as follows:
  • Lemmas 1 and 2 can be evidenced in FIG. 4 .
  • the probability of outputs is based on the coin flips in algorithm M.
  • the results of coin flips can be considered a random bit string in M.
  • a traditional PRNG/CSPRNG can be superseded by a TRNG.
  • the present invention adopt a set of MTJs as a TRNG, which is viewed as a spintronics-based TRNG.
  • the operation of controlling an MTJ to generate random bits is as follows. There are two states for an MTJ [12][3]: Anti-parallel (AP) and Parallel (P), which are assigned to the binary values “0” and “1”, respectively.
  • the initial state (that is, a seed) in the MTJ is unknown, so the MTJ does not need the initial state. Therefore, because the MTJ does not need the initial state, the MTJ can generate random numbers without needing the initial state (the seed) to prevent from privacy leak caused by a periodicity problem of the initial state (the seed).
  • the free-layer magnetization of the MTJ is excited to a bifurcation point by the excite pulse.
  • thermal agitation can cause a small random deviation of magnetization.
  • the magnetization will relax to the AP or P state with the same probability of 50%.
  • the present invention determines whether the final state is AP or P by measuring the resistance, and thus the present invention can obtain a random bit.
  • the present invention leveraged eight subsystems independently to generate a random bit string R i using the stochastic spin-torque switching in MTJs. Subsequently, three rounds of exclusive OR operation are executed to generate the final result (shown in equation (6),
  • XOR 3 denotes the final result of the random bit string and denotes an exclusive OR operation.
  • the MTJ is an emerging magnetic material with properties of high endurance, low power consumption, and rapid access. Moreover, it is easily integrated into many devices such as those in the IoT.
  • the MTJ is a material used in Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM).
  • STT-MRAM Spin-Transfer Torque Magnetic Random Access Memory
  • STT-MRAM a non-volatile memory, has the potential to become a leading storage technology as it is a high-performance memory that can challenge DRAM, SRAM, and the low cost flash memory.
  • the potential advantages of MTJs are the reasons for the present invention to adopt it as a component in the present invention.
  • IRR interleaved data protection
  • TRNG set of MTJs as TRNG( )
  • MTJs can only generate random bits. Therefore, the present invention must design an approach to convert random bits into random numbers between 0 and 1.
  • Algorithm 1 shows the process that uses a set of MTJs to generate random numbers.
  • the length of the random bit sequence 1 should be carefully selected because it will decide the granularity of the random numbers.
  • Input length of random bit sequence 1 ⁇ N
  • Output random number x* ⁇ [0,1] 1
  • 3 Convert x into a random number x*:
  • the Algorithm 2 is demonstrated by randomized data encoding via TRNG( ) in SPARR. For each bit b i ′, TRNG( ) is employed to generate a random number x* (step 1 of Algorithm 2), and then x* is compared with the probability q b i ′ p 1-b i ′ (step 2 of Algorithm 2). If x* is less than q b i ′ p 1-b i ′ , s i is set to 1; otherwise, s i is set to 0 (steps 2-5 of Algorithm 2).
  • Input a resultant bit of PRR b i ′, and probability parameters p and q
  • SPARR is a c-differential privacy algorithm, wherein definition of a is given by equation (7):
  • va and vb are two distinct client-side strings, and their Bloom filter bits are set by equation (8):
  • the ratio RP must be bounded by e ⁇ . Therefore, the present invention can calculate the privacy budget ⁇ by equation (7).
  • NIST-SP800 [2] a statistical test suite
  • the present invention computes the proportion of sequences as shown in Table II to indicate whether the random bits passed the test or not.
  • Section VI-A the present invention will introduce three metrics that are used to evaluate the effects of RAPPOR and SPARR.
  • Sections VI-B and VI-C the present invention will evaluate the present invention using three simulated examples and one real-world collected example, respectively.
  • the three simulated examples use normal, zipf1, and exponential distributions to demonstrate the impacts of varying ⁇ , k, m, and N on RAPPOR and SPARR.
  • the real-world collected example which is based on the Kosarak dataset [1], is used to demonstrate the impact of varying e on RAPPOR and SPARR.
  • A is the actual number of unique client-side strings
  • ai i ⁇ 1, 2, . . . , A ⁇
  • Rr and Rs be the number of unique client-side strings reconstructed by RAPPOR and SPARR, respectively
  • rri and rsi be the proportion of each reconstructed string.
  • FNr and FNs denote the false negative rates for RAPPOR and SPARR, respectively.
  • FNr and FNs are defined by equation (14):
  • the total variation distance is a distance measure for two probability distributions. Informally, this is the largest possible difference between the probabilities that the two probability distributions can be assigned to the same event. In a finite probability space, the total variation distance is related to the 11 norm by its identity.
  • TVr and TVs denote the total variation distances for RAPPOR and SPARR, respectively. Formally, TVr and TVs are defined by equation (15):
  • 1 ⁇ 2 is a standardized item that limits the total variation distance to between 0 and 1.
  • the allocated mass is the total proportion of reconstructed strings.
  • the present invention uses AMr and AMs to denote the allocated masses for RAPPOR and SPARR, respectively.
  • AMr and AMs are defined equation (16):
  • AMr ⁇ i ⁇ ri
  • AMs ⁇ i ⁇ ri ( 16 )
  • the present invention first compares SPARR with RAPPOR over a sequence of simulations and separates the experiments into two parts.
  • the present invention sets k varying from 4 to 32.
  • SPARR. can reduce the false negative rate and the total variation distance by around 37% and 13% on average, respectively.
  • SPARR can increase the allocated mass by around 14% on average.
  • the advantages of SPARR become more apparent as k gradually decreased. This means that SPARR can still achieve well accuracy of data prediction in a harsh network with less bandwidth.
  • SPARR can improve 51%, 20%, and 18% for the false negative rate, the total variation distance, and the allocated mass on average, respectively, if the underlying distribution of the strings' frequencies is a normal distribution.
  • SPARR can improve 67%, 16%, and 17% for the false negative rate, the total variation distance, and the allocated mass on average, respectively.
  • SPARR can improve 55%, 17%, and 15% for the false negative rate, the total variation distance, and the allocated mass on average, respectively. Hence, SPARR can outperform RAPPOR on these metrics, regardless of the distributions.
  • SPARR significantly improves the detection of client-side strings for the low frequencies compared with RAPPOR while maintaining high reconstruction of collected strings.
  • the present invention Due to the limitations of randomized response and statistical inference, the present invention still needs a large amount of reports to find the unique pages and its clicks. This is also the trade-off between privacy and utility, which is mentioned in the related literatures [7][13]. However, as demonstrated later, the present invention can achieve better privacy while recovering more pages that have lower click through rate (CTR).
  • CTR click through rate
  • SPARR is a practical data protection mechanism based on physical events from MTJs for crowdsourced data collection with a high-utility and mathematically rigorous privacy guarantee. It employs a set of MTJs as a spintronics-based TRNG to derive true random numbers. With the spintronics-based TRNG and design of four coin flips, SPARR can preserve privacy and crowdsource population statistics on data collected from individuals and accurately decode this data. Also, the present invention will apply deep learning techniques in the present invention for in-memory computing to improve the efficiency and accuracy of data analysis, and design the present invention to adapt to most data analysis applications.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Power Engineering (AREA)
  • Storage Device Security (AREA)
  • Complex Calculations (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

A data collection and analysis method includes applying a first noise step to an original data stream with an original character to generate a first data stream with a first character; and applying a second noise step to the first data stream to generate a second data stream with a second character, wherein a first variation between the original character and the first character is greater than a second variation between the original character and the second character.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation application of U.S. Non-Provisional application Ser. No. 16/286,627, filed on Feb. 27, 2019, which claims the benefit of U.S. Provisional Application No. 62/636,857, filed on Mar. 1, 2018 and entitled “SPINTRONICS-BASED PRIVATE AGGREGATABLE RANDOMIZED RESPONSE (SPARR) FOR CROWDSOURCED DATA COLLECTION AND ANALYSIS”, the contents of which are incorporated herein by reference.
  • BACKGROUND OF THE INVENTION 1. Field of the Invention
  • The present invention relates to a data collection and analysis method and a related device thereof, and particularly to a method and a related device that can utilize a first noise step and a second noise step to de-identify identification information in an original data stream.
  • 2. Description of the Prior Art
  • In contemporary society, data is crucial for both institutions and individuals. However, they approach data differently. Institutions such as corporations and academic institutes wish to obtain useful information from aggregated user data to improve the pertinence of services or formulate development strategies. By contrast, individuals share their data with interested third parties to obtain various potential benefits but prefer to ensure that their private information such as applications (apps) usage, locations visited, and web browsing history are not revealed. People therefore face a dilemma between maximizing the quality of experiences and minimizing the extent of privacy leakage.
  • The randomized response mechanism [17] has drawn considerable interest from the theory community, which can address the aforementioned dilemma. The concept of randomized response is to induce noise locally before sharing data with any curators. In contrast to centralized differential privacy mechanisms [6] [7] or encryption-based privacy-preserving mechanisms [8] [10], which need the assumptions of the trusted third party or can be used only in a limited range of applications, randomized response mechanisms can provide a rigorous privacy guarantee by the definition of local differential privacy while having broader application scenarios. In particular, randomized response mechanisms can provide a rigorous privacy guarantee while satisfying differential privacy. In other words, individuals have “plausible deniability” that attackers cannot infer any sensitive information with high confidence, regardless of the attackers' background knowledge.
  • Randomized response was proposed by Warner [17] in 1965 as a survey technique for sensitive questions. After more than 40 years, Dowrk et al. proposed a robust and mathematically rigorous definition of privacy in [7] and formally named it; they also proposed the definition of differential privacy in [6]. The local model of private learning was first described by Kasiviswanathan et al. [16], who pioneered the practice of connecting the randomized response to differential privacy. Later, Chan et al. [18] proved that randomized response has optimal lower bound in the locally differentially private model, referred to as local differential privacy.
  • In recent years, the local model has received increasing attention because it does not require a trusted data curator [15]. In practical applications, people want to know which elements occur the most frequently among all items, referred to as the heavy-hitters problem. Erlingsson et al. developed randomized aggregatable privacy-preserving ordinal response (RAPPOR) [21], which uses the Bloom filter [5] to represent the true client-side string and release an obfuscated version after two layer randomized response. One of the greatest contributions of RAPPOR is its delicate decoding framework for learning statistics, which can not only identify the heavy hitters but also rebuild the frequency distribution.
  • Since the development of RAPPOR, many studies of private learning have been conducted under the local model. An extended version of RAPPOR was proposed by Fanti et al. [11]. They presented a new decoding algorithm to address two problems in RAPPOR: (1) aggregators can only determine the marginal distribution and not joint distribution; (2) aggregators can only decode efficiently under a precise data dictionary. However, in targeting these two problems, they sacrifice the capability to accurately reconstruct data. After decoding, aggregators could only observe a few clients' strings that appear with high frequency.
  • Qin et al. [23] devised a two-phase mechanism named LDPMiner, which first uses part of the privacy budget a [6] to generate a candidate set of heavy hitters and then uses the remaining part to refine the results. LDPMiner expanded the application scenario of RAPPOR, which focused on heavy hitter discovery in set-valued data instead of categorical data.
  • Wang et al. [20] introduced the OLH protocol to determine the optimal parameters for RAPPOR. However, OLH is only applied to discover heavy hitters for the small size of the domain of the values the users have. By contrast, RAPPOR and the method of the present invention do not have this constraint. Sei and Ohsuga [22] proposed S2M and S2 Mb and used mean square errors (MSEs) and Jensen-Shannon (JS) divergence to illustrate that both can achieve utility similar to RAPPOR. Although [20] and [22] were significative to the development of RAPPOR, the different indicators in evaluation mean that the present invention cannot make horizontal comparison on them.
  • In addition, other works that are different from RAPPOR also inspired the present invention. Bassily and Smith [14] gave protocols that produce a succinct histogram, that is, the heavy hitters with the number of times they appear, and showed that their protocols matching lower bounds for frequency estimation. Papernot et al. [13] demonstrated PATE that is a machine learning strategy to preserve sensitive training data. PATE trains “teachers” model on disjoint subsets (e, g., different subsets of clients) of the sensitive data, and then a “student” model learns to predict an output chosen by noisy voting among all of the “teachers” model.
  • It is noteworthy that the randomness of randomized response mechanisms, such as [11] [20] [21], comes from coin flip controlling by pseudorandom number generators (PRNGs) or cryptographically secure pseudo-random number generators (CSPRNGs). The quality of random number function greatly affects the degree of privacy protection. However, the insecurity can be seen directly. More precisely, PRNGs/CSPRNGs are implemented in software and use deterministic algorithms, such as =dev=urandom [9], to generate a sequence of random numbers, which is safe for cryptographic use only if the seed can be selected correctly.
  • SUMMARY OF THE INVENTION
  • An embodiment of the present invention provides a data collection and analysis method. The method includes applying a first noise step to an original data stream with an original character to generate a first data stream with a first character; and applying a second noise step to the first data stream to generate a second data stream with a second character, wherein a first variation between the original character and the first character is greater than a second variation between the original character and the second character.
  • Another embodiment of the present invention provides a data collection and analysis method. The method includes applying a first noise step to an original data stream with a featured distribution to generate a first data stream with a first distribution; and applying a second noise step to the first data stream to generate a second data stream with a second distribution, wherein a first variation between the featured distribution and the first distribution is greater than a second variation between the featured distribution and the second distribution.
  • Another embodiment of the present invention provides a data collection and analysis device. The device includes a first processor and a second processor. The first processor applies a first noise step to an original data stream with an original character to generate a first data stream with a first character; and the second processor applies a second noise step to the first data stream to generate a second data stream with a second character, wherein a first variation between the original character and the first character is greater than a second variation between the original character and the second character.
  • Another embodiment of the present invention provides a data collection and analysis device. The device includes a true random number generator and a processor unit. The true random number generator generates a plurality of random number without the need of a seed. The processor unit, based on the plurality of random number, de-identifies identification information in an original data stream with an original character and generating a second data stream with a second character. The second character is substantially similar to the original character.
  • In the present invention, truly random number generators (TRNGs) should be considered as a fundamental building block of privacy-preserving mechanisms. TRNGs are implemented in hardware and generate a sequence of random numbers using a nondeterministic physical event such as magnetization variation of a ferromagnetic layer and flux transmission in a single flux quantum device. The initial state in TRNGs is truly unknown while the initial state in PRNGs/CSPRNGs must be manually kept secret. However, the major drawback of TRNGs is scalability, which is important especially for Internet of Things (IoT), expected to handle a growing amount of data securely. A magnetic tunnel junction (MJT) can be referred as a spintronics-based TRNG. Binary random bits are generated by using the stochastic nature of spin-transfer-torque (STT) switching in MJTs. Owing to the scalability of spin-torque switching [12], the MTJ can operate as a scalable TRNG and can be integrated on a chip in high density and low power consumption.
  • For achieving the purpose of analyzing data with high accuracy and strong privacy, the intuitive way is to shuffle the primitive data while ensuring the randomness inside the algorithms through a series of elaborated encoding techniques and analysis mechanisms.
  • Motivated by this, the present invention establishes the spintronics-based private aggregatable randomized response (SPARR), an advanced data collection and analysis mechanism that conforms to the differential privacy requirements by using a multilayer randomized response based on a set of MTJs. To the best of our knowledge, the integration of multilayer randomized responses with spin-electronic physical events to enhance data utility and privacy for practical applications has not been developed.
  • The main contributions of the present invention include: 1) the present invention proposes the multilayer randomized response mechanism, which can significantly improve the accuracy of data analysis while satisfying the definition of local differential privacy; 2) the present invention leverages a set of MTJs as a TRNG to generate unpredictable random bits and design an approach to convert random bits into random numbers between 0 and 1, in which the TRNG can be seamlessly integrated with multilayer randomized response mechanism and used to strengthen the randomness of our algorithm's outputs; 3) the present invention evaluates the method by a sequence of experiments in both simulation and real-world environment to verify that the method outperforms prior works.
  • These and other objectives of the present invention will no doubt become obvious to those of ordinary skill in the art after reading the following detailed description of the preferred embodiment that is illustrated in the various figures and drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a diagram illustrating a model of crowd sensing and collection for SPARR.
  • FIG. 2 is a diagram illustrating Hash encoding, the permanent randomized response (PRR), the instantaneous randomized response (IRR), and the synthetic randomized response (SRR).
  • FIG. 3 is a diagram illustrating the original data stream having the featured distribution, the first data stream having the first distribution, and the second data stream having the second distribution.
  • FIG. 4 is a diagram illustrating the multilayer randomized response from the perspective of conditional probability.
  • FIGS. 5A-5C are diagrams illustrating comparison of the false negative rate, total variation distance, and allocated mass in varying k, m, and N.
  • FIGS. 6A-6C are diagrams illustrating comparison of the false negative rate, total variation distance, and allocated mass in varying ε.
  • FIGS. 7A-7C are diagrams illustrating population of client-side strings reconstructed by SPARR and RAPPOR when using (a) normal distribution, (b) zipf1 distribution, and (c) exponential distribution at ε=4, respectively.
  • FIG. 8 is a diagram illustrating comparison of SPARR and RAPPOR when using Kosarak under different ε.
  • FIG. 9 is a diagram illustrating population of client-side strings reconstructed by SPARR and RAPPOR when using Kosarak dataset at ε=4.
  • DETAILED DESCRIPTION
  • In the present invention, SPARR mainly focuses on two aspects that are distinct from the aforementioned schemes: (1) the present invention employs a set of MTJs as the spintronics-based TRNG, which can provide the rigorous privacy protection; (2) the present invention proposes a multilayer randomized response mechanism to protect the data privacy and improve the data utility, and use the false negative rate, the total variation distance, and the allocated mass as metrics of the present invention to prove that SPARR can achieve significant favorable performance than prior works.
  • In this section, the definition of SPARR, including the system model, the attack model, and notations are formulated and described in detail.
  • A. System Model
  • In the present invention, the present invention considers a model, composed of unconditionally trusted clients 102 (data generation), and semi-trusted storage servers 104 (data collection) and analysts 106 (data analysis) as shown in FIG. 1 . Without loss of generality, the authorization between clients and analysts is appropriately conducted off-line or on-line. However, the authorization is out of the scope of the present invention. More details regarding to the authorization of clients can be referred to [19].
  • As depicted in FIG. 1 , storage servers can collect sanitized values and strings transmitted from large numbers of clients. Moreover, analysts are permitted to do statistics on these client-side sanitized values and strings, such as histograms, frequencies, and other statistic indicators for finding their app preferences, historical activities, or other information. For any given sanitized value or string reported, SPARR can guarantee a strong plausible deniability for the reporting client through a sequence of encoding steps, as measured by an ε-differential privacy bound. Doing so, SPARR strictly limits private information leaked.
  • B. Attack Model
  • In crowd sensing and collection modes, client-side private data can be disclosed by many ways. Assuming that storage servers and analysts are honest-but-curious, they may leak private information unintentionally by publishing data analyses or may violate the privacy intentionally by gathering sensitive data. There are several attack types, such as the attacker may poach data stored on servers or attempt to eavesdrop on communication between clients and servers. For remedying these attacks, the present invention adopts a local privacy-preserving mechanism that is implemented on each client and sanitizes any information before it is outsourced by the client. The local privacy preserving mechanism satisfying the definition of ε-differential privacy (called local differential privacy) can have rigorous privacy guarantee, regardless of the attackers' background knowledge.
  • C. Notations
  • Please refer to FIG. 2 . FIG. 2 is a diagram illustrating Hash encoding, permanent randomized response (PRR), instantaneous randomized response (IRR), and synthetic randomized response (SRR), wherein detailed descriptions of FIG. 2 are as follows.
  • In the present invention, N represents the number of reports. m represents the number of cohorts. h represents the number of hash functions (that is, Hash encoding), wherein a pre-processor receives an input data stream (e.g. the client-side private data stream) and utilizes Hash encoding to the input data stream to generate an original data stream with an original character (e.g. positions of “1” of the original data stream shown in FIG. 2 ), wherein the input data stream correspond to a plurality of users. In addition, the pre-processor can be a field programmable gate array (FPGA) with the above mentioned functions of the pre-processor, or an application-specific integrated circuit (ASIC) with the above mentioned functions of the pre-processor, or a software module with the above mentioned functions of the pre-processor. k represents the size of Bloom filter. p, q, and f represent probability parameters for the degree of data privacy. bi, bi′, si, and si′ represent the resultant bits of Bloom filter, PRR, IRR, and SRR, respectively. F represents the privacy budget of differential privacy. q′ represents the probability of generating 1 in the report si′, if the Bloom filter bit bi is set to 1. p′ represents the probability of generating 1 in the report si′, if the Bloom filter bit bi is set to 0. And A represents the number of unique client-side strings.
  • Preliminaries
  • In this section, the present invention briefly describes the definitions of differential privacy and randomized response.
  • A. Differential Privacy and Randomized Response
  • The concept of differential privacy [6] ensures that the outputs of certain mechanisms have almost the same probability to appear. In other words, the presence or absence of any individual in the dataset will never significantly influence the outputs.
  • Suppose there is a universe D that contains all the different elements. Conveniently, the present invention uses multisets of rows to represent a dataset D, which can be seen as a collection of elements in D and is held by a trusted curator. Then the present invention can uses Hamming distance to measure the difference between any two datasets D1 and D2, which is denoted by H(D1;D2). If H(D1;D2)=1, then D1 and D2 are called neighboring datasets.
  • Formally, a randomized algorithm M is ε-differentially private if for all SM⊂Range (M) and the neighboring datasets D1 and D2 (shown in equation (1)).

  • Pr[M(D 1)∈S M]≤e ε ×Pr[M(D 2)∈S M]  (1)
  • As shown in equation (1), the probability is over the coin flips of the mechanism M, and ε is called privacy budget and determines the extent of privacy leakage. A smaller F will provide better privacy with the price of lower accuracy.
  • The local model of differential privacy [16], namely local differential privacy, considers a situation in which there is no trusted curator. Individuals hold their own private data and release it to curators in a differentially private manner. In this case, the dataset D will evolve into a sequence of client strings d, and the neighboring datasets D1 and D2 will also evolve into two distinct strings d1 and d2. Therefore, a local randomized algorithm M is ε-differential privacy if for all SM⊂Range (M) and every pair of distinct strings d1 and d2 (shown in equation (2).

  • Pr[M(d 1)∈S M]≤e ε ×Pr[M(d 2)∈S M]  (2)
  • As shown in equation (2), the probability of outputs is taken in terms of the coin flips of the algorithm M.
  • Randomized response [17] is a technique developed long before differential privacy. It uses secret coin flips as random events to generate the answers to sensitive questions, such as “Are you a homosexual?” An individual would answer this truthfully only if the coin is tail. Otherwise, the individual will flip a second coin to determine the fake answer, and respond “Yes” if head and “No” if tail. Randomized response is a proven efficient mechanism that satisfies local differential privacy [18].
  • SPARR System
  • SPARR includes two key elements, namely multilayer randomized response and spintronics-based encoding, to provide a high-utility truly randomized response with a rigorous data privacy guarantee.
  • A. Multilayer Randomized Response
  • The present invention interprets SPARR from the perspective of coin flips. Initially, each client side is permanently assigned to one of m cohorts, and each cohort uses a different group of h hash functions. For simplicity, the present invention considers the case that m=1 in this section (i.e., all client sides use the same group of hash functions). Then, the present invention hashes the client-side string v onto the k-size Bloom filter B. In this sequence, each bit bi in B will be reported after four rounds of perturbation determined by flipping specific coins. The present invention depicts the weight of each coin in Table I, in which probability parameters fall into the range of 0 to 1.
  • The first layer is called the permanent randomized response (PRR), which is similar to the initial randomized response in Section III-A. The result of PRR, bi′, is generated by coins 1 and 2, where the first coin is an unfair coin that comes up as heads with probability f. If the result of a coin flip is head, bi′ will be determined by the second coin with fair probability.
  • TABLE I
    COIN FLIPS IN SPARR, WHERE f ∈ [0; 1), p ∈ (0; 1),
    q ∈ (0; 1), AND p ≠ q
    Bit String Head Tail
    Bloom filter
    bit (bi)
    PRR (bi ) Coin 1 f 1 − f
    Coin
    2 1/2 1/2
    IRR (si) Coin 3 q b i p 1 - b i ( 1 - q ) b i ( 1 - p ) 1 - b i
    SRR (si ) Coin 4 b i + b i + s i 3 1 - b i + b i + s i 3
  • Otherwise, the present invention will do nothing and let bi′ be the true value of bi. The second layer, the instantaneous randomized response (IRR), is created to protect longitudinal security [4], wherein a first processor can apply the permanent randomized response at least one time to the original data stream based on a first random number set generated by a true random number generator to generate a temporal data stream (shown in FIG. 2 ), and apply the instantaneous randomized response at least one time to the temporal data stream based on a second random number set generated by the true random number generator to generate a first data stream with a first character (shown in FIG. 2 ). In addition, the permanent randomized response and the instantaneous randomized response are included in a first noise step, and identification information in the original data stream is de-identified after the first processor applies the first noise step to the original data stream. In addition, the first processor can be a field programmable gate array with the above mentioned functions of the first processor, or an application-specific integrated circuit with the above mentioned functions of the first processor, or a software module with the above mentioned functions of the first processor.
  • The result of IRR, si, is generated by coin 3. Notably, bi′ will affect the weight of coin 3. If bi′=1, the probability of head is q; otherwise, the probability of head is p. In fact, these two layers can guarantee the data privacy but lose information so that later data analysis is inaccurate.
  • An intuitive way to improve the accuracy of data analysis is to retain more features from the primitive data without compromising data privacy. The last layer, the synthetic randomized response (SRR), is constructed in SPARR on the basis of PRR and IRR to strengthen the features in bi, being kept in si′ while preserving the randomness of the results, wherein a second processor applies the synthetic randomized response at least one time to the first data stream based on a third random number set generated by the true random number generator to generate a second data stream with a second character (shown in FIG. 2 ), and the synthetic randomized response is included in a second noise step. In addition, a first variation between the original character and the first character is greater than a second variation between the original character and the second character (shown in FIG. 2 ).
  • That is, the synthetic randomized response can recovery and intensify the original character to make the second data stream approach the original data stream. For example, positions of “1” in the original data stream are similar to positions of “1” in the second data stream. The present invention therefore employs si′ to efficiently reconstruct the client-side strings, even though these strings have a low frequency. SRR operates the last coin, as shown in Table I, wherein a function of SRR is used for reducing shift caused by PRR and IRR. The present invention designs the weight of this coin through the synthetic consideration of the values of bi, bi′, and si. The more frequent the occurrence of is in bi, bi′, and si, the higher the probability that the coin will be heads. For example, if two of the three are is, then the probability of head will be ⅔. In addition, the hash function, PRR, SRR, and IRR are executed on the client sides. In addition, an output circuit can output the original data stream, the first data stream, and the second data stream to server(s) on the Internet.
  • Please refer to FIG. 3 . FIG. 3 is a diagram illustrating the original data stream having the featured distribution, the first data stream having the first distribution, and the second data stream having the second distribution. In another embodiment of the present invention, after the pre-processor utilizes Hash encoding to input data streams (e.g. client-side private data streams) to generate the original data stream, the original data stream has the featured distribution; after the first processor applies the permanent randomized response to the original data stream and applies the instantaneous randomized response to the temporal data stream to generate the first data stream, the first data stream has the first distribution; and after the second processor applies the synthetic randomized response to the first data stream to generate the second data stream, the second data stream has the second distribution. Therefore, when the server(s) on the Internet receives original data streams, first data streams, and second data streams, the server(s) can plot FIG. 3 according to featured distribution of the original data streams, first distribution of the first data streams, and second distribution of the second data streams, wherein as shown in FIG. 3 , a third variation between the featured distribution and the first distribution is greater than a fourth variation between the featured distribution and the second distribution.
  • In addition, in another embodiment of the present invention, a data collection and analysis device includes a true random number generator, a processor unit, a pre-processor, and an output circuit, wherein the processor unit includes a first processor and a second processor. The true random number generator can generate a plurality of random number without the need of a seed (e.g. a first random number set, a second random number set, and a third random number set). The pre-processor receives an input data stream (e.g. the client-side private data stream) and utilizes Hash encoding to the input data stream to generate an original data stream with an original character (e.g. positions of “1” of the original data stream shown in FIG. 2 ). After the original data stream is generated, the first processor can apply the first noise step (PRR and IRR) to the original data stream based on the first random number set and the second random number set to de-identify the original data stream to generate a first data stream with a first character; and after the first data stream is generated, the second processor can apply the second noise step (SRR) to the first data stream based on the third random number set to generate a second data stream with a second character, wherein a first variation between the original character and the first character is greater than a second variation between the original character and the second character. In addition, the output circuit can output the second data stream to a remote server (on the Internet).
  • In summary, the present invention can quantitatively interpret SPARR from the perspective of conditional probability, as shown in FIG. 4 . Each round operates under the conditions bi=1 and bi=0. As shown in FIG. 4 , for simplicity, {⋅} denotes probability P{si|bi}.
  • In addition, the present invention has lemmas 1 and 2 as follows:
  • LEMMA 1. When the Bloom filter bit bi is set to 1, the probability of generating 1 in the report si′ is given by equation (3)
  • q = P ( s i = 1 | b i = 1 ) = 1 3 [ f 2 ( 1 - p ) ] + 2 3 [ f 2 p + ( 1 - f 2 ) ( 1 - q ) ] + ( 1 - f 2 ) q = 1 3 [ 2 + q - f 2 ( 1 - p + q ) ] ( 3 )
  • LEMMA 2. When the Bloom filter bit bi is set to 0, the probability of generating 1 in the report si′ is given by equation (4)
  • p = P ( s i = 1 b i = 0 ) = 1 3 [ ( 1 - f 2 ) p ] + f 2 ( 1 - q ) ] + 2 3 [ f 2 q ] = 1 3 [ p + f 2 ( 1 - p + q ) ] ( 4 )
  • Lemmas 1 and 2 can be evidenced in FIG. 4 .
  • To decode the collection si′ for aggregators, the number of times ti, required for reconstructing the exact bit bi in the Bloom filter B must be estimated. Let ci be the number of times that each bit si′ is set in N reports. Therefore, the expectation of ci is given by equation (5):
  • E ( c i ) = q t i + p ( N - t i ) where t i = c i - p N q - p . ( 5 )
  • B. Spintronics-Based Encoding
  • As the present invention stated in the formal definition of randomized response, the probability of outputs is based on the coin flips in algorithm M. In other words, the results of coin flips can be considered a random bit string in M. To guarantee the randomness of the bit strings, a traditional PRNG/CSPRNG can be superseded by a TRNG.
  • In the present invention, the present invention adopt a set of MTJs as a TRNG, which is viewed as a spintronics-based TRNG. The operation of controlling an MTJ to generate random bits is as follows. There are two states for an MTJ [12][3]: Anti-parallel (AP) and Parallel (P), which are assigned to the binary values “0” and “1”, respectively. The initial state (that is, a seed) in the MTJ is unknown, so the MTJ does not need the initial state. Therefore, because the MTJ does not need the initial state, the MTJ can generate random numbers without needing the initial state (the seed) to prevent from privacy leak caused by a periodicity problem of the initial state (the seed). When a current pulse was injected into an MTJ to switch the magnetization in the free layer by spin-transfer torque, the free-layer magnetization of the MTJ is excited to a bifurcation point by the excite pulse. At the bifurcation point, thermal agitation can cause a small random deviation of magnetization. Then, the magnetization will relax to the AP or P state with the same probability of 50%. Eventually, the present invention determines whether the final state is AP or P by measuring the resistance, and thus the present invention can obtain a random bit.
  • To generate sufficient randomness for bit strings, the present invention leveraged eight subsystems independently to generate a random bit string Ri using the stochastic spin-torque switching in MTJs. Subsequently, three rounds of exclusive OR operation are executed to generate the final result (shown in equation (6),

  • XOR3=[(R 1 ⊗R 2)ε(R 3 ⊗R 4)]⊗[(R 5 ⊗R 6)⊗(R 7 ⊗R 8)]  (6)
  • As shown in equation (6), XOR3 denotes the final result of the random bit string and denotes an exclusive OR operation. Notably, the MTJ is an emerging magnetic material with properties of high endurance, low power consumption, and rapid access. Moreover, it is easily integrated into many devices such as those in the IoT. In particular, the MTJ is a material used in Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM). STT-MRAM, a non-volatile memory, has the potential to become a leading storage technology as it is a high-performance memory that can challenge DRAM, SRAM, and the low cost flash memory. The potential advantages of MTJs are the reasons for the present invention to adopt it as a component in the present invention.
  • As mentioned for multilayer randomized response mechanism, one of the most critical procedures for data protection is IRR, which avoids the risk of privacy leakage under repeated data collection by generating a different report every time based on the fixed results of PRR. Therefore, the randomness of IRR determines the performance of longitudinal privacy protection. The present invention introduces a set of MTJs as TRNG( ), which is based on nondeterministic physical events. However, MTJs can only generate random bits. Therefore, the present invention must design an approach to convert random bits into random numbers between 0 and 1.
  • Algorithm 1 shows the process that uses a set of MTJs to generate random numbers. The length of the random bit sequence 1 should be carefully selected because it will decide the granularity of the random numbers. First, the present invention initializes eight MTJs and operate them independently to generate 1 random bits, followed by executing three rounds of exclusive OR operation to generate a binary bit sequence x=XOR3 (step 2 of Algorithm 1). Finally, the binary bit sequence x is converted to a random number x* by the equation float(x=(21−1)) (step 3 of Algorithm 1).
  • ALGORITHM 1: TRNG( )
  • Input: length of random bit sequence 1∈N
    Output: random number x*└[0,1]
    1 Initialize MTJs and generate 1 random bits;
    2 Execute three rounds of exclusive OR operation to generate a binary bit sequence x=XOR3;
    3 Convert x into a random number x*:
  • x*=float (x/(21−1));
  • 4 Return x*
  • The Algorithm 2 is demonstrated by randomized data encoding via TRNG( ) in SPARR. For each bit bi′, TRNG( ) is employed to generate a random number x* (step 1 of Algorithm 2), and then x* is compared with the probability qb i p1-b i (step 2 of Algorithm 2). If x* is less than qb i p1-b i , si is set to 1; otherwise, si is set to 0 (steps 2-5 of Algorithm 2).
  • ALGORITHM 2: Data Randomized Encoding via TRNG( )
  • Input: a resultant bit of PRR bi′, and probability parameters p and q
    Output: an encoded bit si
    1 x*=TRNG( );
    2 if x*<qb i p1-b i then
    3 set si=1;
    4 else
    5 set si=0;
    6 end
  • 7 Return si: System Analysis A. Differential Privacy Guarantee
  • THEOREM 1. SPARR is a c-differential privacy algorithm, wherein definition of a is given by equation (7):
  • ε = h · ln [ q ( 1 - P ) p ( 1 - q ) ] ( 7 )
  • Without loss of generality, the present invention supposes that va and vb are two distinct client-side strings, and their Bloom filter bits are set by equation (8):

  • B a ={b 1=1, . . . ,b h=1,b h+1=0, . . . ,b k=0},

  • B b ={b 1=0, . . . ,b h=0,b h+1=1, . . . ,b 2h=1,b 2h+1=0, . . . ,b k=0},  (8)
  • According to Lemmas 1 and 2, the present invention knows that si′ is a random variable with Bernoulli distribution, and the probability mass functions under different conditions are determined equations (9)-(12):
  • P ( s i b i = 1 ) = ( q ) s i ( 1 - q ) 1 - s i = { q , s i = 1 1 - q , s i = 0 ( 9 ) And P ( s i b i = 0 ) = ( p ) s i ( 1 - p ) 1 - s i = { p , s i = 1 1 - p , s i = 0 ( 10 ) Then , P ( s = s a | B = B a ) = i = 1 h ( q ) s i ( 1 - q ) 1 - s i · i = h + 1 k ( p ) s i ( 1 - p ) 1 - s i ( 11 ) And P ( s = s a | B = B b ) = i = 1 h ( p ) s i ( 1 - p ) 1 - s i · i = h + 1 2 h ( q ) s i ( 1 - q ) 1 - s i · i = 2 h + 1 k ( p ) s i ( 1 - p ) 1 - s i ( 12 )
  • Let RP be the ratio of two conditional probabilities and S be all possible outputs of S′. Using the conclusions drawn from Observation 1 in [21], RP can derive by equation (13):
  • RP = P ( s s | B = B a ) P ( s s | B = B b ) = s s | P ( s s i | B = B a ) s s | P ( s s i | B = B b ) max s s P ( s s i | B = B a ) P ( s s i | B = B b ) = max s s { [ q ( 1 - P ) ] s 1 + + s h - s h + 1 - - s 2 h · [ p ( 1 - q ) ] s 1 - - s h - s h + 1 + + s 2 h = [ q ( 1 - P ) q ( 1 - q ) ] h ( 13 )
  • As shown in equation (13), s1′= . . . =sh′=1 and sh+1′= . . . =s2h′=0.
  • To satisfy the definition of differential privacy, the ratio RP must be bounded by eε. Therefore, the present invention can calculate the privacy budget ε by equation (7).
  • B. Randomness Analysis of Numbers Generated by MTJs
  • Good random numbers should meet the requirement for unpredictability, meaning that the random number should not be periodic. Specifically, good random bits should also meet the requirement for uniformity, which means 0 and 1 should occur with roughly equal frequency. Formally, after obtaining random bits by triggering eight MTJs, the present invention uses a statistical test suite (NIST-SP800 [2]) to test the random bits used in our system. The NIST-SP800 provides several types of statistical tests, which are detailed in Section 2 of [2].
  • TABLE II
    NIST TESTING RESULTS USING BITS GENERATED MTJS
    Proportion of
    Statistical Test passing sequences Success/Fail
    Frequency  987/1000 Success
    Block Frequency 1000/1000 Success
    Cumulative Sums  994/1000 Success
    Runs  986/1000 Success
    Longest Run 1000/1000 Success
    FFT  995/1000 Success
    Approximate 1000/1000 Success
    Entropy
    Serial  995/1000 Success
  • Given the empirical results for eight particular statistical tests, the present invention computes the proportion of sequences as shown in Table II to indicate whether the random bits passed the test or not. In Table II, when 1000 sequences (100 bits/sequence) are used as the test target, the minimum passing rate for each statistical test with the exception of the random excursion (variant) test is approximately=0:986, passing the NIST statistical test.
  • EXPERIMENTAL EVALUATION
  • In this section, the present invention will make a detailed comparison between RAPPOR and SPARR. Though Fanti et al. [11] proposed an extended version of RAPPOR, it focuses on estimating client side strings without explicit dictionary knowledge. However, the accuracy of estimation in [11] is similar to or less than that of RAPPOR. Therefore, the present invention does not compare SPARR with [11].
  • In Section VI-A, the present invention will introduce three metrics that are used to evaluate the effects of RAPPOR and SPARR. In Sections VI-B and VI-C, the present invention will evaluate the present invention using three simulated examples and one real-world collected example, respectively. The three simulated examples use normal, zipf1, and exponential distributions to demonstrate the impacts of varying ε, k, m, and N on RAPPOR and SPARR. The real-world collected example, which is based on the Kosarak dataset [1], is used to demonstrate the impact of varying e on RAPPOR and SPARR.
  • A. Resultant Metrics
  • Suppose A is the actual number of unique client-side strings, and ai (i∈{1, 2, . . . , A}) is the proportion of each client-side string. Let Rr and Rs be the number of unique client-side strings reconstructed by RAPPOR and SPARR, respectively, and rri and rsi be the proportion of each reconstructed string.
  • Here, the present invention uses the false negative rate to analyze the extent to which RAPPOR and SPARR failed to find certain strings. For simplicity, FNr and FNs denote the false negative rates for RAPPOR and SPARR, respectively. Formally, FNr and FNs are defined by equation (14):
  • FN r = R r - A A , FN s = R s - A A ( 14 )
  • The total variation distance is a distance measure for two probability distributions. Informally, this is the largest possible difference between the probabilities that the two probability distributions can be assigned to the same event. In a finite probability space, the total variation distance is related to the 11 norm by its identity. For simplicity, TVr and TVs denote the total variation distances for RAPPOR and SPARR, respectively. Formally, TVr and TVs are defined by equation (15):
  • TVr = 1 2 i "\[LeftBracketingBar]" a i - τ ri "\[RightBracketingBar]" , TVs = 1 2 i "\[LeftBracketingBar]" a i - τ si "\[RightBracketingBar]" ( 15 )
  • As shown in equation (15), ½ is a standardized item that limits the total variation distance to between 0 and 1.
  • The allocated mass is the total proportion of reconstructed strings. For simplicity, the present invention uses AMr and AMs to denote the allocated masses for RAPPOR and SPARR, respectively. Formally, AMr and AMs are defined equation (16):
  • AMr = i τ ri , AMs = i τ ri ( 16 )
  • B. Simulation Results
  • After clarifying the resultant metrics, the present invention first compares SPARR with RAPPOR over a sequence of simulations and separates the experiments into two parts.
  • In the first part, the present invention varies the parameters k, m and N, which influence the accuracy of SPARR and RAPPOR but do not affect the degree of privacy protection. More specifically, the present invention fixes ε=4, which is relatively loose for both mechanisms. Therefore, the present invention can faithfully observe the impact of these parameters on the accuracy of estimation. In the second part, the present invention sets k=8, m=56, and N=1,000,000, which are evidenced as an optimal case for SPARR and RAPPOR in the first part. Then, the present invention varies e from 1 to 4 through tuning the parameters h, f, p, and q, and apply different distributions to observe the impacts of different privacy degrees.
  • 1) The impacts of varying k, m, and N: the test cases and their experimental results are shown in Table III and FIGS. 5A-5C. Due to the limit of space, the present invention only shows the key results based on normal distribution, but this does not prevent the present invention from explaining the generality.
  • TABLE III
    RESULTANT METRICS (FALSE NEGATIVE RATE, TOTAL VARIATION DISTANCE, AND ALLOCATED MASS) FOR
    SIMULATIONS UNDER DIFFERENT k, m, AND N
    (a)
    Test Result Metrics
    Case Rr Rs ΔR FNr FNs ΔFN TVr TVs ΔTV AMr AMs ΔAM
    k = 4  25 85 60 0.75 0.15 −0.6 0.46 0.11 −0.35 0.53 0.99 0.46
    k = 8  53 89 36 0.47 0.11 −0.36 0.19 0.06 −0.14 0.87 0.99 0.12
    k = 16 62 92 30 0.38 0.18 −0.3 0.12 0.06 −0.07 0.92 0.98 0.06
    k = 24 61 98 37 0.39 0.02 −0.37 0.11 0.05 −0.06 0.93 0.98 0.05
    k = 32 66 90 24 0.34 0.1 −0.24 0.1 0.05 −0.05 0.94 0.97 0.02
    Mean 53 ± 91 ± 37 ± 0.47 ± 0.09 ± −0.37 ± 0.2 ± 0.07 ± −0.13 ± 0.84 ± 0.98 ± 0.14 ±
    2 1 43 0.02 0.01 0.03 0.01 0.01 0.02 0.03 0.04 0.07
    (b)
    Test Result Metrics
    Case Rr Rs ΔR FNr FNs ΔFN TVr TVs ΔTV AMr AMs ΔAM
    m = 16 46 75 29 0.54 0.25 −0.29 0.23 0.14 −0.1 0.79 0.98 0.19
    m = 24 77 87 43 0.56 0.13 −0.43 0.25 0.09 −0.16 0.77 0.99 0.22
    m = 32 48 89 41 0.52 0.11 −0.41 0.23 0.08 −0.15 0.82 0.99 0.18
    m = 40 57 92 35 0.43 0.08 −0.35 0.16 0.07 −0.10 0.89 0.99 0.10
    m = 48 58 92 34 0.42 0.08 −0.34 0.17 0.06 −0.11 0.90 0.99 0.09
    m = 56 54 91 37 0.46 0.09 −0.37 0.20 0.05 −0.14 0.86 0.99 0.12
    m = 64 51 93 42 0.49 0.07 −0.42 0.20 0.05 −0.15 0.87 0.99 0.12
    Mean 51 ± 88 ± 37 ± 0.49 ± 0.12 ± −0.37 ± 0.21 ± 0.08 ± −0.13 ± 0.84 ± 0.99 ± 0.15 ±
    2 1 3 0.02 0.01 0.03 0.01 0.00 0.01 0.03 0.04 0.07
    (c)
    Test Result Metrics
    Case Rr Rs ΔR FNr FNs ΔFN TVr TVs ΔTV AMr AMs ΔAM
    N = 100000  19 69 50 0.81 0.31 −0.50 0.44 0.11 −0.32 0.49 0.94 0.45
    N = 250000  29 80 51 0.71 0.20 −0.51 0.37 0.08 −0.29 0.70 0.97 0.27
    N = 500000  41 88 47 0.59 0.12 −0.47 0.28 0.06 −0.22 0.77 0.98 0.22
    N = 750000  46 88 42 0.54 0.12 −0.42 0.21 0.05 −0.16 0.80 0.98 0.18
    N = 1000000 55 90 35 0.45 0.10 −0.35 0.17 0.05 −0.12 0.89 0.99 0.10
    Mean 38 ± 83 ± 45 ± 0.62 ± 0.17 ± −0.45 ± 0.29 ± 0.07 ± −0.22 ± 0.73 ± 0.97 ± 0.24 ±
    3 1 4 0.03 0.01 0.04 0.01 0.00 0.01 0.03 0.04 0.07
  • In case (a), the present invention sets k varying from 4 to 32. Compared to RAPPOR, SPARR. can reduce the false negative rate and the total variation distance by around 37% and 13% on average, respectively. Also, SPARR can increase the allocated mass by around 14% on average. In particular, the advantages of SPARR become more apparent as k gradually decreased. This means that SPARR can still achieve well accuracy of data prediction in a harsh network with less bandwidth.
  • The number of cohorts m will impact the collision probability of two strings in the Bloom filter. To guarantee accuracy, there is a trade-off between N and m. In case (b), when m varies from 16 to 64, SPARR can significantly reduce the false negative rate and the total variation distance by around 37% and 13% on average, respectively, while maintaining the allocated mass at approximately 1.
  • In case (c), it demonstrates the relationship between the number of reconstructed strings and the number of reports N. Compared to RAPPOR, SPARR can significantly reduce the false negative rate, the total variation distance, and the allocated mass by around 45%, 22%, and 24% on average, respectively. This shows that SPARR can use a small amount of data to accurately estimate the distribution of unique client side strings. Specifically, SPARR can be applied to general platforms even for small collections.
  • 2) The impact of varying ε: The present invention demonstrates the impact of varying ε from 1 to 4 for different distributions in Table IV and FIGS. 6A-6C while setting k=8, m=56 and N=1,000,000. Compared to RAPPOR, SPARR can improve 51%, 20%, and 18% for the false negative rate, the total variation distance, and the allocated mass on average, respectively, if the underlying distribution of the strings' frequencies is a normal distribution. When the underlying distribution of the strings' frequencies is a zipf1 distribution, SPARR can improve 67%, 16%, and 17% for the false negative rate, the total variation distance, and the allocated mass on average, respectively. When the underlying distribution of the strings' frequencies is an exponential distribution, SPARR can improve 55%, 17%, and 15% for the false negative rate, the total variation distance, and the allocated mass on average, respectively. Apparently, SPARR can outperform RAPPOR on these metrics, regardless of the distributions.
  • TABLE IV
    RESULTANT METRICS (FALSE NEGATIVE RATE, TOTAL VARIATION DISTANCE, AND ALLOCATED MASS) FOR
    SIMULATIONS UNDER DIFFERENT ε
    (a)
    Test Case
    (Normal Result Metrics
    Distribution) Rr Rs ΔR FNr FNs ΔFN TVr TVs ΔTV AMr AMs ΔAM
    ε = 1   1 69 68 0.99 0.31 −0.68 0.56 0.17 −0.39 0.14 0.67 0.54
    ε = 1.5 5 71 66 0.95 0.29 −0.66 0.56 0.25 −0.31 0.27 0.50 0.23
    ε = 2   29 79 50 0.71 0.21 −0.50 0.36 0.20 −0.16 0.58 0.62 0.05
    ε = 2.5 34 84 50 0.66 0.16 −0.50 0.28 0.14 −0.14 0.69 0.75 0.06
    ε = 3   42 83 41 0.58 0.17 −0.41 0.22 0.10 −0.13 0.75 0.85 0.11
    ε = 3.5 41 91 50 0.59 0.09 −0.50 0.23 0.07 −0.15 0.74 0.93 0.18
    ε = 4   56 89 33 0.44 0.11 −0.33 0.19 0.06 −0.12 0.90 0.99 0.10
    Mean 30 ± 81 ± 51 ± 0.70 ± 0.19 ± −0.51 ± 0.34 ± 0.14 ± −0.20 ± 0.58 ± 0.76 ± 0.18 ±
    3 1 4 0.03 0.01 0.04 0.01 0.01 0.02 0.02 0.03 0.05
    (b)
    Test Case
    (Normal Result Metrics
    Distribution) Rr Rs ΔR FNr FNs ΔFN TVr TVs ΔTV AMr AMs ΔAM
    ε = 1   0 68 68 1.00 0.32 −0.68 0.50 0.18 −0.32 0.00 0.66 0.66
    ε = 1.5 5 79 74 0.95 0.21 −0.74 0.46 0.26 −0.21 0.37 0.50 0.13
    ε = 2   23 83 60 0.77 0.17 −0.60 0.27 0.20 −0.08 0.71 0.63 −0.08
    ε = 2.5 19 85 66 0.81 0.15 −0.66 0.28 0.16 −0.13 0.69 0.75 0.06
    ε = 3   28 95 67 0.72 0.05 −0.67 0.25 0.11 −0.14 0.77 0.86 0.08
    ε = 3.5 25 94 69 0.75 0.06 −0.69 0.23 0.09 −0.14 0.77 0.93 0.16
    ε = 4   31 93 62 0.69 0.07 −0.62 0.19 0.09 −0.10 0.79 0.99 0.20
    Mean 19 ± 85 ± 66 ± 0.81 ± 0.15 ± −0.67 ± 0.31 ± 0.15 ± −0.16 ± 0.59 ± 0.76 0.17 ±
    3 1 4 0.03 0.01 0.04 0.01 0.01 0.02 0.02 0.03 0.05
    (c)
    Test Case
    (Normal Result Metrics
    Distribution) Rr Rs ΔR FNr FNs ΔFN TVr TVs ΔTV AMr AMs ΔAM
    ε = 1   1 63 62 0.99 0.37 −0.62 0.55 0.18 −0.37 0.10 0.68 0.58
    ε = 1.5 5 70 65 0.95 0.30 −0.65 0.54 0.26 −0.28 0.33 0.50 0.17
    ε = 2   24 80 56 0.76 0.20 −0.56 0.29 0.19 −0.10 0.69 0.64 −0.06
    ε = 2.5 27 79 52 0.73 0.21 −0.52 0.30 0.15 −0.15 0.70 0.74 0.05
    ε = 3   39 87 48 0.61 0.13 −0.48 0.22 0.12 −0.10 0.82 0.85 0.03
    ε = 3.5 36 86 50 0.64 0.14 −0.50 0.18 0.08 −0.09 0.79 0.93 0.14
    ε = 4   38 87 49 0.62 0.13 −0.49 0.16 0.07 −0.08 0.82 0.99 0.18
    Mean 24 ± 79 ± 55 ± 0.76 ± 0.21 ± −0.55 ± 0.32 ± 0.15 ± −0.17 ± 0.61 ± 0.76 ± 0.15 ±
    2 1 4 0.03 0.01 0.04 0.01 0.01 0.02 0.03 0.03 0.06
  • More intuitively, the present invention evaluates the population in three distributions (i.e., normal, zipf1, and exponential) of the client side strings with their true frequencies on the vertical axis by comparing SPARR with RAPPOR, as demonstrated in FIGS. 7A-7C providing k=32, m=64, and N=1,000, 000 at ε=4. Notably, for fairly comparing SPARR with RAPPOR, the present invention selects ε=4 which is optimal case for SPARR and RAPPOR in this evaluation. According to FIGS. 7A-7C, SPARR significantly improves the detection of client-side strings for the low frequencies compared with RAPPOR while maintaining high reconstruction of collected strings.
  • C. Real-World Results
  • In addition to the simulated data, the present invention also runs SPARR and RAPPOR on a real-world dataset. Specifically, this dataset is from “Frequent Itemset Mining Dataset Repository”, called Kosarak, which is provided by Ferenc Bodon [1]. Kosarak records about 990,000 reports of click action involving 41,270 different pages, and the web masters may want to know the popularity of each page through the estimation of clicks. Without loss of generality, the present invention only cares about 100 most visited pages. Similar to the settings in the previous section, the present invention fixes k=8 and m=56, and e varied from 1 to 4, experimenting at each interval of 0.5.
  • Due to the limitations of randomized response and statistical inference, the present invention still needs a large amount of reports to find the unique pages and its clicks. This is also the trade-off between privacy and utility, which is mentioned in the related literatures [7][13]. However, as demonstrated later, the present invention can achieve better privacy while recovering more pages that have lower click through rate (CTR).
  • The experimental results are shown in Table V and plotted in FIG. 8 . It is clearly seen that under the same ε, SPARR has lower false negative rate and total variation distance than RAPPOR with lightly sacrificing allocated mass. The advantages of SPARR become more apparent as ε decreases. The present invention can see more apparently in FIG. 9 , which shows the population of client-side strings reconstructed by SPARR and RAPPOR when using Kosarak dataset at ε=4. It is worth noting that while focusing on high CTR pages, the present invention should not overlook websites that have vital meaning but in the long tail, such as those for specialized topics or for specific groups of people. It can be seen that SPARR is better than RAPPOR in fairness since it can recover almost all the pages independent of CTR.
  • TABLE V
    RESULTANT METRICS (FALSE NEGATIVE RATE, TOTAL VARIATION DISTANCE, AND ALLOCATED MASS) FOR
    REAL-WORLD CASE UNDER DIFFERENT ε
    Test Case
    (Normal Result Metrics
    Distribution) Rr Rs ΔR FNr FNs ΔFN TVr TVs ΔTV AMr AMs ΔAM
    ε = 1   3 89 86 0.97 0.11 −0.86 0.34 0.26 −0.08 0.31 0.47 0.16
    ε = 1.5 8 94 86 0.92 0.06 −0.86 0.30 0.25 −0.05 0.54 0.51 −0.03
    ε = 2   26 95 69 0.74 0.05 −0.69 0.18 0.18 0.00 0.78 0.64 −0.14
    ε = 2.5 32 97 65 0.68 0.03 −0.65 0.20 0.14 −0.06 0.80 0.76 −0.04
    ε = 3   43 97 54 0.57 0.03 −0.54 0.15 0.11 −0.04 0.86 0.85 −0.01
    ε = 3.5 39 97 58 0.61 0.03 −0.58 0.17 0.09 −0.08 0.84 0.92 0.08
    ε = 4   53 95 42 0.47 0.05 −0.42 0.12 0.09 −0.03 0.91 0.98 0.07
    Mean 29 ± 95 ± 66 ± 0.71 ± 0.05 ± −0.66 ± 0.21 ± 0.16 ± −0.05 ± 0.72 ± 0.73 0.01 ±
    3 1 4 0.02 0.01 0.03 0.01 0.01 0.02 0.03 0.03 0.06
  • CONCLUSIONS
  • SPARR is a practical data protection mechanism based on physical events from MTJs for crowdsourced data collection with a high-utility and mathematically rigorous privacy guarantee. It employs a set of MTJs as a spintronics-based TRNG to derive true random numbers. With the spintronics-based TRNG and design of four coin flips, SPARR can preserve privacy and crowdsource population statistics on data collected from individuals and accurately decode this data. Also, the present invention will apply deep learning techniques in the present invention for in-memory computing to improve the efficiency and accuracy of data analysis, and design the present invention to adapt to most data analysis applications.
  • REFERENCE
    • [1] Kosarak. Available at http://fimi.ua.ac.be/data/.
    • [2] A. Rukhin, J. Soto, J. Nechvatal, M. Smid, E. Barker, S. Leigh, M. Levenson, M. Vangel, D. Banks, A. Heckert, J. Dray, and S. Vo, “A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications,” National Institute of Standards and Technology (NIST), Special Publication 800-22 Revision 1. Available at http://csrc.nist.gov/publications/PubsSPs.html, 2008.
    • [3] A. Fukushima, T. Seki, K. Yakushiji, H. Kubota, H. Imamura, S. Yuasa, and K. Ando, “Spindice: A Scalable Truly Random Number Generator Based on Spintronics,” in Journal of Applied Physics Express, vol. 7, no. 8, pp. 083001, 2014.
    • [4] B. Edwards, S. Hofmeyr, S. Forrest, and M. V. Eeten, “Analyzing and Modeling Longitudinal Security Data: Promise and Pitfalls,” in Proceedings of the 31st Annual Computer Security Applications Conference, pp. 391-400, 2015.
    • [5] B. H. Bloom, “Space/Time Trade-offs in Hash Coding with Allowable Errors,” Communications of the ACM, vol. 13, no. 7, pp. 422-426, 1970.
    • [6] C. Dwork, “Differential Privacy,” in Proceedings of the 33rd International Colloquium on Automata, Languages and Programming, pp. 1-12, 2006.
    • [7] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating Noise to Sensitivity in Private Data Analysis,” in 3rd Theory of Cryptography Conference, pp. 265-284, 2006.
    • [8] C. Wang, K. Ren, S. Yu, and K. M. R. Urs, “Achieving Usable and Privacy assured Similarity Search over Outsourced Cloud Data,” in Proceedings of IEEE International Conference on Computer Communications, pp. 451-459, 2012.
    • [9] D. J. Bernstein, “ChaCha, a Variant of Salsa20.” Available at http://cr.yp.to/chacha.html, 2008.
    • [10] E. Stefanov, C. Papamanthou, and E. Shi, “Practical Dynamic Searchable Encryption with Small Leakage,” in Proceedings of Network Distribution System Security Symposium, 832-848, 2014.
    • [11] G. Fanti, V. Pihur, U. Erlingsson, “Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries,” in Proceedings on Privacy Enhancing Technologies, pp. 41-61, 2016.
    • [12] J. D. Harms, F. Ebrahimi, X. Yao, and J.-P. Wang, “SPICE Macromodel of Spin-Torque-Transfer-Operated Magnetic Tunnel Junctions,” in IEEE Transactions on Electron Devices, vol. 57, no. 7, pp. 1425-1430, 2010.
    • [13] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar, “Semi-Supervised Knowledge Transfer for Deep Learning from Private Training Data,” In Proceedings of the 5th International Conference on Learning Representations, to appear, 2017.
    • [14] R. Bassily, and A. Smith, “Local, Private, Efficient Protocols for Succinct Histograms,” in Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 127-135, 2015.
    • [15] R. Chen, A. Reznichenko, P. Francis, and J. Gehrke, “Towards Statistical Queries over Distributed Private User Data,” in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 169-182, 2012.
    • [16] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith, “What Can We Learn Privately?,” in SIAM Journal of Computing, vol. 40, no. 3, pp. 793-826, 2011.
    • [17] S. Warner, “Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias,” in Journal of the American Statistical Association, vol. 60, no. 309, pp. 63-69, 1965.
    • [18] T-H. Chan, E. Shi, and D. Song, “Optimal Lower Bound for Differentially Private Multi-Party Aggregation,” in Proceedings of the 20th Annual European conference on Algorithms, pp. 277-288, 2012.
    • [19] T. Jung, X.-Y. Li, Z. Wan, and M. Wan, “Privacy preserving cloud data access with multi-authorities,” in Proceedings of IEEE International Conference on Computer Communications, pp. 2625-2633, 2013.
    • [20] T. Wang, J. Blocki, N. Li, and S. Jha, “Optimizing Locally Differentially Private Protocols,” in 26th USENIX Security Symposium, to appear, 2017.
    • [21] U. Erlingsson, V. Pihur, and A. Korolova, “RAPPOR: Randomized aggregatable privacy-preserving ordinal response,” In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 1054-1067, 2014.
    • [22] Y. Sei and A. Ohsuga, “Differential Private Data Collection and Analysis Based on Randomized Multiple Dummies for Untrusted Mobile Crowdsensing,” in IEEE Transactions on Information Forensics and Security, vol. 12, no. 4, pp. 926-939, 2017.
    • [23] Z. Qin, Y. Yang, T. Yu, I. Khalil, X. Xiao, and K. Ren, “Heavy Hitter Estimation over Set-Valued Data with Local Differential Privacy,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 192-203, 2016.
  • Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.

Claims (15)

What is claimed is:
1. A data collection and analysis system comprising:
an input unit for receiving an original data stream with an original character;
an integrated circuit coupled to the input unit, for receiving the original data stream, generating a first data stream with a first character by applying a first noise to the original data stream, and generating a second data stream with a second character by applying a second noise to the first data stream,
wherein a first variation between the original character and the first character is greater than a second variation between the original character and the second character.
2. The data collection and analysis system of claim 1, further comprises:
a true random number generator for generating a first random number set and a second random number set;
wherein the integrated circuit applies a permanent randomized response (PRR) to the original data stream based on the first random number set, and applies an instantaneous randomized response (IRR) to the temporal data stream based on the second random number set.
3. The data collection and analysis system of claim 2, wherein the true random number generator generates a third random number set and the processor applies a synthetic randomized response (SRR) to the first data stream based on a third random number to generate the second data stream.
4. The data collection and analysis system of claim 1, wherein identification information in the original data stream is de-identified after applying the first noise to the original data stream.
5. The data collection and analysis system of claim 1, further comprises:
a pre-processor for receiving an input data stream and utilizing Hash encoding to the input data stream to generate the original data stream with the original character to the input unit.
6. A data collection and analysis device comprising:
a first processor applying a first noise step to an original data stream with an original character to generate a first data stream with a first character; and
a second processor applying a second noise step to the first data stream to generate a second data stream with a second character, wherein a first variation between the original character and the first character is greater than a second variation between the original character and the second character.
7. The data collection and analysis device of claim 6, further comprising:
a true random number generator generating a first random number set, a second random number set, and a third random number set;
wherein the first processor applies a permanent randomized response to the original data stream based on the first random number set to generate a temporal data stream, and applies an instantaneous randomized response to the temporal data stream based on the second random number set to generate the first data stream.
8. The data collection and analysis device of claim 7, wherein the second processor applies a synthetic randomized response to the first data stream based on the third random number set to generate the second data stream.
9. The data collection and analysis device of claim 6, wherein identification information in the original data stream is de-identified after the first processor applies the first noise step to the original data stream.
10. The data collection and analysis device of claim 6, further comprising:
a pre-processor receiving an input data stream and utilizing Hash encoding to the input data stream to generate the original data stream with the original character; and
an output circuit outputting the second data stream.
11. A data collection and analysis device comprising:
a true random number generator generating a plurality of random numbers without the need of a seed; and
a processor unit, based on the plurality of random numbers, de-identifying identification information in an original data stream with an original character and generating a second data stream with a second character;
wherein the second character is substantially similar to the original character.
12. The data collection and analysis device of claim 11, further comprising:
a pre-processor receiving an input data stream and utilizing Hash encoding to the input data stream to generate the original data stream with the original character; and
an output circuit outputting the second data stream to a remote server.
13. The data collection and analysis device of claim 11, wherein the plurality of random numbers comprising a first random number set and a second random number set, and the processor unit comprising a first processor; wherein the first processor applies a permanent randomized response to the original data stream based on the first random number set to generate a temporal data stream, and the first processor further applies an instantaneous randomized response to the temporal data stream based on the second random number set to generate a first data stream with a first character.
14. The data collection and analysis device of claim 13, wherein the plurality of random numbers further comprising a third random number set, and the processor unit further comprising a second processor; wherein the second processor applies a synthetic randomized response to the first data stream based on the third random number set to generate the second data stream with the second character.
15. The data collection and analysis device of claim 14, wherein a first variation between the original character and the first character is greater than a second variation between the original character and the second character.
US17/969,447 2018-03-01 2022-10-19 Data collection and analysis system and device Pending US20230060864A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/969,447 US20230060864A1 (en) 2018-03-01 2022-10-19 Data collection and analysis system and device

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201862636857P 2018-03-01 2018-03-01
US16/286,627 US11514189B2 (en) 2018-03-01 2019-02-27 Data collection and analysis method and related device thereof
US17/969,447 US20230060864A1 (en) 2018-03-01 2022-10-19 Data collection and analysis system and device

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US16/286,627 Continuation US11514189B2 (en) 2018-03-01 2019-02-27 Data collection and analysis method and related device thereof

Publications (1)

Publication Number Publication Date
US20230060864A1 true US20230060864A1 (en) 2023-03-02

Family

ID=67767694

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/286,627 Active 2041-03-01 US11514189B2 (en) 2018-03-01 2019-02-27 Data collection and analysis method and related device thereof
US17/969,447 Pending US20230060864A1 (en) 2018-03-01 2022-10-19 Data collection and analysis system and device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US16/286,627 Active 2041-03-01 US11514189B2 (en) 2018-03-01 2019-02-27 Data collection and analysis method and related device thereof

Country Status (3)

Country Link
US (2) US11514189B2 (en)
CN (2) CN110221809B (en)
TW (2) TWI799722B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110995736B (en) * 2019-12-13 2021-11-30 中国兵器装备集团自动化研究所有限公司 Universal industrial Internet of things equipment management system
US11676160B2 (en) 2020-02-11 2023-06-13 The Nielsen Company (Us), Llc Methods and apparatus to estimate cardinality of users represented in arbitrarily distributed bloom filters
US11741068B2 (en) * 2020-06-30 2023-08-29 The Nielsen Company (Us), Llc Methods and apparatus to estimate cardinality of users represented across multiple bloom filter arrays
CN112016047A (en) * 2020-07-24 2020-12-01 浙江工业大学 Heuristic data acquisition method and device based on evolutionary game, computer equipment and application thereof
US11755545B2 (en) 2020-07-31 2023-09-12 The Nielsen Company (Us), Llc Methods and apparatus to estimate audience measurement metrics based on users represented in bloom filter arrays
US11552724B1 (en) 2020-09-16 2023-01-10 Wells Fargo Bank, N.A. Artificial multispectral metadata generator
US11929992B2 (en) * 2021-03-31 2024-03-12 Sophos Limited Encrypted cache protection
WO2022225302A1 (en) * 2021-04-19 2022-10-27 서울대학교산학협력단 Method and server for estimating frequency distribution for location data
US20230017374A1 (en) * 2021-06-24 2023-01-19 Sap Se Secure multi-party computation of differentially private heavy hitters
US11854030B2 (en) 2021-06-29 2023-12-26 The Nielsen Company (Us), Llc Methods and apparatus to estimate cardinality across multiple datasets represented using bloom filter arrays
CN114614974B (en) * 2022-03-28 2023-01-03 云南电网有限责任公司信息中心 Privacy set intersection method, system and device for power grid data cross-industry sharing
TWI824927B (en) * 2023-01-17 2023-12-01 中華電信股份有限公司 Data synthesis system with differential privacy protection, method and computer readable medium thereof

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9020873B1 (en) * 2012-05-24 2015-04-28 The Travelers Indemnity Company Decision engine using a finite state machine for conducting randomized experiments
US9575160B1 (en) * 2016-04-25 2017-02-21 Uhnder, Inc. Vehicular radar sensing system utilizing high rate true random number generator
US20170243028A1 (en) * 2013-11-01 2017-08-24 Anonos Inc. Systems and Methods for Enhancing Data Protection by Anonosizing Structured and Unstructured Data and Incorporating Machine Learning and Artificial Intelligence in Classical and Quantum Computing Environments
US20180004978A1 (en) * 2016-06-29 2018-01-04 Sap Se Anonymization techniques to protect data
US20180189164A1 (en) * 2017-01-05 2018-07-05 Microsoft Technology Licensing, Llc Collection of sensitive data--such as software usage data or other telemetry data--over repeated collection cycles in satisfaction of privacy guarantees
US20190050599A1 (en) * 2016-02-09 2019-02-14 Orange Method and device for anonymizing data stored in a database
US20190236306A1 (en) * 2018-02-01 2019-08-01 Microsoft Technology Licensing, Llc Remote testing analysis for software optimization based on client-side local differential privacy-based data
US10691829B2 (en) * 2017-04-13 2020-06-23 Fujitsu Limited Privacy preservation

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI415315B (en) * 2009-04-07 2013-11-11 Univ Nat Changhua Education Racetrack nonvolatile memory manufacturing method and structure thereof
US8694856B2 (en) * 2009-08-14 2014-04-08 Intrinsic Id B.V. Physically unclonable function with tamper prevention and anti-aging system
US9778912B2 (en) * 2011-05-27 2017-10-03 Cassy Holdings Llc Stochastic processing of an information stream by a processing architecture generated by operation of non-deterministic data used to select data processing modules
TWI528217B (en) * 2014-07-02 2016-04-01 柯呈翰 A method and system for adding dynamic labels to a file and encrypting the file
CN105306194B (en) * 2014-07-22 2018-04-17 柯呈翰 For encrypted file and/or the multiple encryption method and system of communications protocol
CN104867138A (en) * 2015-05-07 2015-08-26 天津大学 Principal component analysis (PCA) and genetic algorithm (GA)-extreme learning machine (ELM)-based three-dimensional image quality objective evaluation method
IL239880B (en) * 2015-07-09 2018-08-30 Kaluzhny Uri Simplified montgomery multiplication
US10390220B2 (en) * 2016-06-02 2019-08-20 The Regents Of The University Of California Privacy-preserving stream analytics
US10229282B2 (en) * 2016-06-12 2019-03-12 Apple Inc. Efficient implementation for differential privacy using cryptographic functions
US10778633B2 (en) * 2016-09-23 2020-09-15 Apple Inc. Differential privacy for message text content mining
US10599867B2 (en) * 2017-06-04 2020-03-24 Apple Inc. User experience using privatized crowdsourced data
CN107358115B (en) * 2017-06-26 2019-09-20 浙江大学 It is a kind of consider practicability multiattribute data go privacy methods

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9020873B1 (en) * 2012-05-24 2015-04-28 The Travelers Indemnity Company Decision engine using a finite state machine for conducting randomized experiments
US20170243028A1 (en) * 2013-11-01 2017-08-24 Anonos Inc. Systems and Methods for Enhancing Data Protection by Anonosizing Structured and Unstructured Data and Incorporating Machine Learning and Artificial Intelligence in Classical and Quantum Computing Environments
US20190050599A1 (en) * 2016-02-09 2019-02-14 Orange Method and device for anonymizing data stored in a database
US9575160B1 (en) * 2016-04-25 2017-02-21 Uhnder, Inc. Vehicular radar sensing system utilizing high rate true random number generator
US20180004978A1 (en) * 2016-06-29 2018-01-04 Sap Se Anonymization techniques to protect data
US20180189164A1 (en) * 2017-01-05 2018-07-05 Microsoft Technology Licensing, Llc Collection of sensitive data--such as software usage data or other telemetry data--over repeated collection cycles in satisfaction of privacy guarantees
US10691829B2 (en) * 2017-04-13 2020-06-23 Fujitsu Limited Privacy preservation
US20190236306A1 (en) * 2018-02-01 2019-08-01 Microsoft Technology Licensing, Llc Remote testing analysis for software optimization based on client-side local differential privacy-based data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Ulfar Erlingsson, RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response, 25 Aug 2014, pages 1-14 (Year: 2014) *

Also Published As

Publication number Publication date
US11514189B2 (en) 2022-11-29
TWI799722B (en) 2023-04-21
CN110221809A (en) 2019-09-10
TWI702505B (en) 2020-08-21
US20190272388A1 (en) 2019-09-05
TW202328939A (en) 2023-07-16
CN110221809B (en) 2023-12-29
TW202046138A (en) 2020-12-16
CN117724679A (en) 2024-03-19
TW201937389A (en) 2019-09-16

Similar Documents

Publication Publication Date Title
US20230060864A1 (en) Data collection and analysis system and device
McGregor et al. The limits of two-party differential privacy
Corrigan-Gibbs et al. Riposte: An anonymous messaging system handling millions of users
Chen Using algebraic signatures to check data possession in cloud storage
Ullman Answering n {2+ o (1)} counting queries with differential privacy is hard
Chen et al. Pseudorandom Number Generator Based on Three Kinds of Four‐Wing Memristive Hyperchaotic System and Its Application in Image Encryption
Yu et al. Chaos‐Based Engineering Applications with a 6D Memristive Multistable Hyperchaotic System and a 2D SF‐SIMM Hyperchaotic Map
Chatterjee et al. Theory and application of delay constraints in arbiter PUF
Huang et al. A new two‐dimensional mutual coupled logistic map and its application for pseudorandom number generator
Bitansky et al. On the complexity of collision resistant hash functions: New and old black-box separations
Bun et al. Separating computational and statistical differential privacy in the client-server model
Boneh et al. Arithmetic sketching
Arockiasamy et al. Beyond Statistical Analysis in Chaos‐Based CSPRNG Design
Watanabe et al. Bit security as computational cost for winning games with high probability
Tsou et al. SPARR: Spintronics-based private aggregatable randomized response for crowdsourced data collection and analysis
Lin et al. PPDCA: privacy-preserving crowdsensing data collection and analysis with randomized response
Rao et al. Secure two-party feature selection
Brunetta et al. Code-based zero knowledge PRF arguments
TWI840155B (en) Data collection and analysis device and method thereof
Zhang et al. Efficient Cloud-Based Private Set Intersection Protocol with Hidden Access Attribute and Integrity Verification.
Frisch et al. A Practical Approach to Estimate the Min-Entropy in PUFs
Yuan et al. Application of Blockchain Based on Fabric Consensus Network Model in Secure Encryption of Educational Information
Park et al. A lightweight BCH code corrector of trng with measurable dependence
Wang et al. Efficient transfer learning on modeling physical unclonable functions
Almishari et al. Privacy-preserving matching of community-contributed content

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER