CN115270176A - Radix estimation method, system, computing device and computer storage medium - Google Patents
Radix estimation method, system, computing device and computer storage medium Download PDFInfo
- Publication number
- CN115270176A CN115270176A CN202210866709.4A CN202210866709A CN115270176A CN 115270176 A CN115270176 A CN 115270176A CN 202210866709 A CN202210866709 A CN 202210866709A CN 115270176 A CN115270176 A CN 115270176A
- Authority
- CN
- China
- Prior art keywords
- data
- calculators
- determines
- estimation result
- data structure
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000012795 verification Methods 0.000 claims description 30
- 238000004891 communication Methods 0.000 claims description 15
- 238000013507 mapping Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 description 10
- 230000035945 sensitivity Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 3
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 238000004321 preservation Methods 0.000 description 2
- 238000013480 data collection Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000011022 operating instruction Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24558—Binary matching operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/06—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
- H04L9/0643—Hash functions, e.g. MD5, SHA, HMAC or f9 MAC
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/08—Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
- H04L9/0816—Key establishment, i.e. cryptographic processes or cryptographic protocols whereby a shared secret becomes available to two or more parties, for subsequent use
- H04L9/085—Secret sharing or secret splitting, e.g. threshold schemes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L2209/00—Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
- H04L2209/46—Secure multiparty computation, e.g. millionaire problem
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Bioethics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Computer Hardware Design (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Medical Informatics (AREA)
- Power Engineering (AREA)
- Storage Device Security (AREA)
Abstract
The embodiment of the invention discloses a radix estimation method, a radix estimation system, a computing device and a computer storage medium. The method comprises the following steps: the multiple data providers send the respective established local probability data structures and the respective generated random noise to the multiple calculators; each of the calculators determines a target probability data structure based on all the received local probability data structures, and determines an initial estimation result based on the target probability data structure and all the received random noise; any one of the calculators determines a target base estimation result based on the initial estimation result. In the embodiment of the invention, a plurality of data providers provide random noise, and a calculator calculates the estimation result of the cardinality based on the probability data structure and all the random noise, so that any attacker has no way to deduce whether a certain individual is in a set or not through the estimation result of the cardinality, thereby realizing the differential privacy protection in the estimation of the cardinality.
Description
Technical Field
The embodiment of the invention relates to the technical field of privacy calculation, in particular to a cardinality estimation method, a cardinality estimation system, a cardinality estimation computing device and a computer storage medium.
Background
Distributed radix calculation, namely calculating the number of different elements of a union of a plurality of data sets, has been a very fundamental and very important problem, and with the rapid development of the internet, the importance of privacy protection is more and more emphasized.
In the related art, the radix is usually calculated based on a bloom filter, but the scheme cannot protect differential privacy, and an attacker can reversely deduce whether a certain user is in a set through the result, so that privacy data are leaked.
Therefore, how to implement differential privacy protection in cardinality estimation is a technical problem to be solved urgently in the prior art.
Disclosure of Invention
In view of the above, embodiments of the present invention are proposed in order to provide a radix estimation method, system, computing device and computer storage medium that overcome or at least partially solve the above-mentioned problems.
According to an aspect of an embodiment of the present invention, there is provided a radix estimation method including:
the multiple data providers send the respective established local probability data structures and the respective generated random noise to the multiple calculators;
each of the calculators determines a target probability data structure based on all the received local probability data structures, and determines an initial estimation result based on the target probability data structure and all the received random noise;
any one of the calculators determines a target base estimation result based on the initial estimation result.
In an alternative, for each of the data providers, the local probabilistic data structure is determined by:
the data provider collects statistical data for radix estimation, and performs Hash mapping on the statistical data to obtain a random bit string;
the data provider determines a data position corresponding to the established zero bit value probability data structure based on the random bit string;
and the data provider sets the data of the data position corresponding to the zero bit value probability data structure to be 1 to obtain the local probability data structure.
In an optional manner, for each data provider, the hash mapping the statistical data to obtain a random bit string includes:
and the data provider acquires a hash key, and performs hash mapping on the statistical data and the hash key to obtain the random bit string.
In an alternative, the local probabilistic data structure contains a set number of bit strings, each of which contains a one-dimensional bit vector of a set length.
In an alternative, the process of, for each of the data providers, the data provider sending a local probabilistic data structure and random noise to a plurality of the calculators comprises:
the data provider acquires data to be sent and a finite field, wherein the data to be sent is random noise or a single bit in the local probability data structure;
the data provider determines a plurality of secret sharing values corresponding to the data to be sent based on the finite field, wherein the number of the secret sharing values corresponding to the data to be sent is the same as the number of the computing parties;
and the data provider sends each secret sharing value to each calculator, wherein the secret sharing values received by different calculators are different.
In an optional manner, the determining, by the data provider, a plurality of secret sharing values corresponding to the data to be sent based on the finite field includes:
a plurality of computing parties respectively generate random number share values corresponding to the data to be sent;
the data provider receives each random number share value, determines a random number corresponding to the data to be sent based on each random number share value, determines a difference value between the data to be sent and the random number corresponding to the data to be sent, and determines a plurality of secret sharing values corresponding to the difference value based on the finite field.
In an optional manner, the determining, by any one of the calculators, a target base estimation result based on the initial estimation result includes:
each computing party obtains a global key share corresponding to each computing party, determines a first verification result corresponding to the initial estimation result based on the initial estimation result and the global key share, and broadcasts the first verification result;
any one of the calculators receives first verification results of other calculators, and all the first verification results are added to obtain a second verification result;
if the second verification result is not equal to 0, stopping cardinality estimation by all the data providers and all the calculators;
and if the second verification result is equal to 0, determining a target base number estimation result by any one of the calculators based on the initial estimation result.
According to another aspect of the embodiments of the present invention, there is provided a cardinality estimation system including a plurality of data providers and a plurality of calculators;
the data providers are used for sending the respectively established local probability data structures and the respectively generated random noise to the calculators;
each of the calculators, configured to determine a target probability data structure based on all the received local probability data structures, and determine an initial estimation result based on the target probability data structure and all the received random noise;
any one of the calculators is further configured to determine a target cardinality estimation result based on the initial estimation result.
According to still another aspect of an embodiment of the present invention, there is provided a computing device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the cardinality estimation method.
According to yet another aspect of the embodiments of the present invention, there is provided a computer storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform operations corresponding to the above-mentioned radix estimation method.
According to the cardinality estimation method, the cardinality estimation system, the computing device and the computer storage medium provided by the embodiment of the invention, a plurality of data providers provide random noise, and each computing party computes a cardinality estimation result based on a probability data structure and all received random noise, so that any attacker has no way to deduce whether a certain individual is in a set or not through the cardinality estimation result, and differential privacy protection in cardinality estimation is realized.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 shows a flow diagram of a cardinality estimation method according to one embodiment of the invention;
FIG. 2 illustrates a block diagram of a probabilistic data structure and a random bit string according to one embodiment of the present invention;
FIG. 3 shows a schematic structural diagram of a computing device according to an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
In the solution provided by the embodiment of the present invention, each data provider is configured to collect data to be counted by a calculator, such as an IP address for accessing some websites, a user ID for clicking an advertisement, traffic information collected by a radio frequency identification system, and the like, and store or transmit the data in a manner of a probabilistic data structure, thereby reducing resource occupation space consumption. In this embodiment of the present invention, the calculator may be a device with a larger computing resource, such as a cloud computing server, and the data provider may be a device with a larger data capacity, such as a data warehouse device, which is not limited in this respect.
FIG. 1 shows a flow diagram of a cardinality estimation method according to one embodiment of the invention, the method comprising the steps of:
In this embodiment, each data provider collects a local set of statistics to build a local FMS sketch (local probabilistic data structure) and shares its own local FMS sketch secret to all calculators. Meanwhile, each data provider also generates a discrete Gaussian random variable (namely random noise) for realizing differential privacy protection.
Illustratively, the present embodiment employs a distributed discrete gaussian noise mechanism, specifically, each data provider j generates a signal satisfying a discrete gaussian distribution Nz(0,σ2) Random number (i.e., random noise) Is an integer field, σ2Is the variance. And each data provider secretly shares its random noise to all calculators, and the calculators are based on the noise sum of c data providersDifferential privacy protection is achieved. Experiments prove that the distributed discrete Gaussian noise mechanism can realize that the parameter is 0.5 epsilon2Is condensed differential privacy of, wherein e2For indicating privacy preservation.
Optionally, in an embodiment, for each of the data providers, the local probabilistic data structure is determined by:
the data provider collects statistical data used for radix estimation, and performs Hash mapping on the statistical data to obtain a random bit string;
the data provider determines a data position corresponding to the established zero bit value probability data structure based on the random bit string;
and the data provider sets the data of the data position corresponding to the zero bit value probability data structure to be 1 to obtain the local probability data structure.
It should be noted that FMS sketch is a probabilistic data structure proposed for the first time by the embodiments of the present invention, and this probabilistic data structure may be used to count the cardinality of very large-scale data sets, and the memory requirement only needs several kB, and the time complexity of updating each piece of data is only O (1), which is reduced by 3 orders of magnitude compared to O (m) (m usually takes thousands) of FMsketch.
Optionally, the local probabilistic data structure includes a set number of bit strings, and each bit string includes a one-dimensional bit vector of a set length.
Illustratively, FMS sketch { Bi,h}i=1,…,mContaining m bit strings B1,…,BmH is a hash function for updating the FMS sketch, where m =2rAnd r is a positive integer. Each bit string is a one-dimensional bit vector with the length of w, and the values of m and w can be determined by negotiation of various data providers and calculators. The local FMS sketch for each data provider is initially initialized to 0. In addition, the FMS sketch of each data provider uniformly uses a hash function h () which is used for mapping an element into a random bit string with the length of (r + w-1).
The random bit string is used to update an FMS sketch, such as that shown in FIG. 2, whose corresponding parameters r and w are equal to 3 and 8, respectively, and variable z in FIG. 21To zmThe number of 0 bits corresponding to each of 8 bit strings having sequence numbers 0 to 7 is represented, h (x) represents a hash mapping operation on data x, and 01011100111 is a generated random bit string. The rightmost r (i.e., 3) bit binary 111 of the random bit string is converted into decimal to determine the serial number (i.e., 7) of the bit string in the FMS sketch to be updated, and the specific index (i.e., 2) in the bit string to be updated is determined by the number of consecutive 0's of the first 0 appearing at the rightmost side of the random bit string, thereby determining the data position corresponding to the established zero-bit value probability data structure. And if the bit value of the data position in the FMS sketch is 0, setting the data position to be 1, and if the bit value of the data position in the FMS sketch is 1, not changing the data position, thereby finishing updating the FMS sketch.
Optionally, for each data provider, the hash mapping the statistical data to obtain a random bit string includes:
and the data provider acquires a hash key, and performs hash mapping on the statistical data and the hash key to obtain the random bit string.
Illustratively, all data providers select a keyed hash and perform a verifiable group key protocol to obtain the hash key k. Thus, the hash mapping operation used to update the probability data structure can be implemented by the hash function H (e) = H (k | | e). Where | is a string concatenation operator, e may represent a single element in the statistics.
Optionally, in an embodiment, for each of the data providers, the process of sending the local probabilistic data structure and the random noise to the plurality of the calculators includes:
the data provider acquires data to be sent and a finite field, wherein the data to be sent is random noise or a single bit in the local probability data structure;
the data provider determines a plurality of secret sharing values corresponding to the data to be sent based on the finite field, wherein the number of the secret sharing values corresponding to the data to be sent is the same as the number of the calculators;
and the data provider sends each secret sharing value to each calculator, wherein the secret sharing values received by different calculators are different.
Illustratively, the data provider and the calculator together negotiate a finite fieldWhere the modulus p is a (λ + τ) -bit prime number, λ (say 40) is a statistical security parameter, and τ (say 32) is determined by the length of the plaintext field.
For setting a plurality of secret sharing values corresponding to data to be transmitted, a single bit in a local probability data structure is taken as an example for explanation. Each secret sharing value corresponding to the single bit belongs to the effective domainNamely, the secret sharing value takes one of 0 to (p-1), and the sum of all secret sharing values is summed by using the modulus pAnd performing modulo calculation to obtain a calculation result equal to the value of the single bit corresponding to all the secret sharing values.
Further, after the data provider generates the local FMS sketch and random noise, the data provider secretly shares the data to be sent to all the calculators in the form of secret sharing values. The flow of secret sharing is as follows:
first, under the SPDZ framework, an integer is represented as a shared form within a finite field (e.g., a finite field constructed based on a modulus p, where p is a very large prime number), as follows:
Wherein, the three parts of the triplet represent the addition sharing of the data x to be transmitted, the MAC and the MAC key, respectively. The MAC and MAC key are used to verify the authenticity of the data x to be transmitted, some of which areIf an adversary tampers with the data x to be sent, we can discover it by MAC and MAC key.
Optionally, for each data provider, the determining, by the data provider, multiple secret sharing values corresponding to the data to be sent based on the finite field includes:
a plurality of computing parties respectively generate random number share values corresponding to the data to be sent;
the data provider receives each random number share value, determines a random number corresponding to the data to be sent based on each random number share value, determines a difference value between the data to be sent and the random number corresponding to the data to be sent, and determines a plurality of secret sharing values corresponding to the difference value based on the finite field.
It will be appreciated that the respective data providers and calculators may pre-negotiate key length, number of bit strings, and finite field size according to their requirements for security factor, privacy budget, and estimation accuracy.
Illustratively, all of the computing parties execute an offline protocol, pre-generating a sufficient number of random numbers for subsequent use. In a secret sharing, an unused random number is selected to prevent an adversary from breaking the encryption, and each computing party only takes a secret share of the random number, i.e., a random number share value. It is possible for an adversary to obtain the true value of the random number if and only if all the calculators are controlled.
In order to share a data x to be sent of a data provider to a calculator, a secret sharing subprotocol is adopted in the embodiment. Specifically, all the computing parties (c) disclose a random number to the data provider in commonWherein,and Δ1,…,ΔcRespectively representing the MAC and the MAC key corresponding to the random number a. The method is that each calculating party a ownjAnd sending the data to the data provider. The data provider can then calculate the actual value of the random number a. After acquiring a, the data provider calculates a difference value (x-a) and broadcasts the difference value word to all the calculators.
The protocols used for generating random numbers in the off-line stage may include Triple (), rand2 () and RandExp (), which may be implemented by the SPDZ protocol.
In particular, rand () is used to generate a secret representation of a random integerUsed in secret number addition and secret number and plaintext addition. Rand2 () is used to generate a secret representation of 0 or 1, where each 0 or 1 has more than one encryption result. Triple () for generating three secret numbersWhere c = a × b, triple () may be used for secret number multiplication. RandExp () generates a series of secret numbersUsed in the zero-crossing protocol.
As a possible implementation, the rule for sharing the data x to be sent to the calculator j is as follows:
it will be appreciated that the sharing process is secure since the true value of the random number a is unknown to an adversary who has controlled only c-1 computational parties at most. To avoid possible attacks, the random number a is discarded once it has been used. Therefore, each secret sharing requires a different random number, so that each data provider can share each bit of its own local FMS sketch as a secret to all calculators based on the random number commonly determined by multiple calculators.
And 102, each calculating party determines a target probability data structure based on all the received local probability data structures, and determines an initial estimation result based on the target probability data structure and all the received random noises.
As one possible implementation, the process of FMS sketch for each calculator to compute the union of all statistics is as follows:
assume that each FMS sketch consists of m bit strings: { Bi}i=1,…,mThere are d books of data providersThe ground FMS sketch is used for the ground FMS sketch,b is defined as the value of the ith bit string in the kth local FMS sketchi *[l]Comprises the following steps:
obviously, when B isi *[l]When > 0, Bi[l]=1, when Bi *[l]When =0, Bi[l]=0, wherein Bi[l]The value of the ith bit string in the FMS sketch which is the union.
In the SPDZ protocol, data x and y shared to computing party j in secret (the global key share of computing party j is Δ)j,AndMAC for data x and y), the secret number addition is defined as follows:
thus, in combination with B abovei *[l]And Bi[l]If each calculator wants to calculate and obtain the secret sharing value based on the received secret sharing values corresponding to all FMS sketch(Bi[l]Secret expression of) first need to be based onSecret expression of, calculate Bi *[l]Secret expression ofThen, based on whether the calculation result is 0 or not, deducingThe value of (a).
In this embodiment, each of the computing parties calculates FMS sketch of a union of all statistical data to obtain a target probability data structure, calculates a Z variable of the target probability data structure, and finally securely adds random noise to the Z variable to obtain an initial estimation result, thereby implementing differential privacy protection.
Wherein the Z variable is used to represent the number of bits in the target probability data structure that are still 0, i.e.:
wherein, Bi[x]And the FMS sketch consists of m bit strings, and each bit string has the length of w.
Illustratively, each of the computing parties j holds a noise variabled is the number of data providers, and calculator j adds the noise variables:wherein,is the noise that ultimately needs to be added.
In existence ofAndbased on the initial estimation result calculated by each calculator jWherein, the FMS sketch consists of m bit strings, c is the number of data providers, and each bit string is w in length.
And 103, determining a target base number estimation result by any one of the calculators based on the initial estimation result.
As a possible implementation, assuming that FMS sketch consists of m bit strings, each bit string length w, the expectation of Z is:
wherein:
therefore, when the cardinality is estimated, the sum (i.e., the initial estimation result) between Z and each random noise obtained by calculation is taken as E (Z), and the target cardinality estimation result n is obtained by newton iteration or dichotomy.
Optionally, in an embodiment, the determining, by any one of the computing parties, a target base estimation result based on the initial estimation result includes:
each computing party obtains a global key share corresponding to each computing party, determines a first verification result corresponding to the initial estimation result based on the initial estimation result and the global key share, and broadcasts the first verification result;
any one of the calculators receives first verification results of other calculators, and all the first verification results are added to obtain a second verification result;
if the second verification result is not equal to 0, stopping cardinality estimation by all the data providers and all the calculators;
and if the second verification result is equal to 0, determining a target base number estimation result by any one of the calculators based on the initial estimation result.
As a possible implementation, all the computing parties execute the initialization phase of the SPDZ protocol to determine a global key and each obtain a secret share of the global key, thereby obtaining a global key share.
Illustratively, assume a total of c data providers, NkFor noise variance, Z is the number of 0 bits in the probability data structure of union set, which is disclosed in the SPDZ frameworkPreviously, the authenticity of x was first checked. The inspection method is as follows:
each of the computing parties calculates a first verification resultThen publicly broadcastThe calculator checks whether the second verification result is obtainedIf the result is not 0, it indicates that the calculation result is falsified, and the program is stopped. If the result is 0, which indicates that the calculation is correct, the calculator discloses x to calculate and output the target base estimation result.
Wherein each of the computing parties can calculate an initial estimation result x and a global key share Δ according to the initial estimation result x and the global key share ΔjDeterminingAnd calculating x delta based on the currently stored initial estimate x before disclosing the initial estimate xjAnd is further based onAnd verifying whether the calculation result is tampered.
The cardinality estimation method provided in the above embodiment is described by taking a usage scenario of the disease category statistics of the medical system as an example. At this time, a plurality of hospitals serve as data providers to count patient information (including identity information and disease category of patients) recorded in the local medical system. For each hospital, the patient information recorded in its local medical system will be used to construct the FMS sketch local to the hospital to reduce the resource consumption resulting from the statistics of patient information, thereby reducing the computational resource requirements for the data provider; meanwhile, each hospital provides a random noise for all the calculators, and sends the constructed local FMS sketch and the random noise to all the calculators. Each calculator determines a target FMS sketch capable of representing a union of all patient information based on the received local FMS sketches of all hospitals, obtains an estimation quantity capable of being used for solving the number of the diseased types according to the target FMS sketch, and estimates the number of the diseased types after superposing the estimation quantity and each received random noise, so that an attacker cannot deduce the patient information of any patient based on the estimation result, and differential privacy protection of the patient information in the diseased type estimation (namely base number estimation) is realized.
It can be understood that, the radix estimation method provided by the above embodiment uses an original FMS sketch, and only requires a space of several kB to count billions of data, and meanwhile, for each element in the data stream, only one hash operation is required to update the sketch, while the traditional scheme requires several thousand hash operations to update the sketch, so that the radix estimation method provided by the embodiment of the present invention has very low calculation resources required by the data provider and fast data collection speed. In addition, the embodiment of the invention adopts a method of providing Gaussian noise by a plurality of data providers to realize differential privacy, and solves the problem that the differential privacy is difficult to ensure by the traditional algorithm. Meanwhile, due to the unique structure of the FMS sketch used in the embodiment of the invention, the sensitivity of the calculated variable Z to noise is very low, so that only very small noise needs to be added for realizing the differential privacy.
Specifically, the sensitivity of the algorithm f (x) is defined as: Δ f = max | f (x) -f (y) |.
Where x and y are arbitrary data sets that differ by only one datum. For example, x = { a, b, c, d, e }, y = { a, b, c, d }.
Based on the above sensitivity definition, the sensitivity of the variable Z used by FMS sketch to estimate cardinality is 1.
For the noise size to be added, if a gaussian mechanism is used to implement differential privacy, the noise to be added is gaussian noise satisfying the distribution:
wherein, mu and sigma2Is the mean and variance of the noise, delta is a privacy parameter, epsilon2For indicating privacy preservation.
With other parameter determinations, the variance of the noise that FMS sketch needs to add to achieve differential privacy is only as large asCompared with FM sketch (sensitivity of w.m) and HLL sketch (sensitivity of w + 1), the noise required to be added is very small.
The embodiment of the invention also provides a cardinal number estimation system, which comprises a plurality of data providers and a plurality of calculators;
the data providers are used for sending the respectively established local probability data structures and the respectively generated random noise to the calculators;
each of the calculators is configured to determine a target probability data structure based on all the received local probability data structures, and determine an initial estimation result based on the target probability data structure and all the received random noise;
any one of the calculators is further configured to determine a target cardinality estimation result based on the initial estimation result.
In an optional manner, the data provider is further configured to collect statistical data used for radix estimation, and perform hash mapping on the statistical data to obtain a random bit string; determining a data position corresponding to the established zero bit value probability data structure based on the random bit string; and setting the data of the data position corresponding to the zero bit value probability data structure as 1 to obtain the local probability data structure.
In an optional manner, the data provider is further configured to obtain a hash key, and perform hash mapping on the statistical data and the hash key to obtain the random bit string.
In an alternative, the local probabilistic data structure contains a set number of bit strings, each of which contains a one-dimensional bit vector of a set length.
In an optional manner, the data provider is further configured to obtain data to be sent and a finite field, where the data to be sent is random noise or a single bit in the local probabilistic data structure; determining a plurality of secret sharing values corresponding to the data to be sent based on the finite field, wherein the number of the secret sharing values corresponding to the data to be sent is the same as the number of the computing parties; and sending each secret sharing value to each computing party, wherein the secret sharing values received by different computing parties are different.
In an optional manner, the plurality of computing parties are further configured to generate random number share values corresponding to the data to be sent, respectively;
the data provider is further configured to receive each random number share value, determine a random number corresponding to the data to be sent based on each random number share value, determine a difference between the data to be sent and the random number corresponding to the data to be sent, and determine a plurality of secret sharing values corresponding to the difference based on the finite field.
In an optional manner, each of the computing parties is further configured to obtain a respective corresponding global key share, determine, based on the initial estimation result and the global key share, a first verification result corresponding to the initial estimation result, and broadcast the first verification result;
any one of the calculators is also used for receiving first verification results of other calculators and adding all the first verification results to obtain a second verification result;
if the second verification result is not equal to 0, stopping cardinality estimation by all the data providers and all the calculators;
if the second verification result is equal to 0, any one of the calculators is further configured to determine a target base estimation result based on the initial estimation result.
The descriptions of the modules refer to the corresponding descriptions in the method embodiments, and are not repeated herein.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores at least one executable instruction, and the executable instruction can execute the radix estimation method in any method embodiment.
Fig. 3 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and a specific embodiment of the present invention does not limit a specific implementation of the computing device. As shown in fig. 3, the computing device may include: a processor (processor) 1002, a Communications Interface 1004, a memory 1006, and a Communications bus 1008. Wherein:
the processor 1002, communication interface 1004, and memory 1006 communicate with each other via a communication bus 1008.
A communication interface 1004 for communicating with network elements of other devices, such as clients or other servers. The processor 1002 is configured to execute the program 1010, and may specifically perform the relevant steps in the above embodiments of the cardinality estimation method.
In particular, the program 1010 may include program code that includes computer operating instructions. The processor 1002 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
The memory 1006 is used for storing the program 1010. The memory 1006 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program 1010 may be specifically configured to cause the processor 1002 to execute the cardinality estimation method in any of the method embodiments described above. For specific implementation of each step in the program 1010, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing embodiments of the radix estimation method, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of embodiments of the present invention as described herein, and any descriptions of specific languages are provided above to disclose preferred embodiments of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Claims (10)
1. A cardinality estimation method, comprising:
the multiple data providers send the respective established local probability data structures and the respective generated random noise to the multiple calculators;
each of the calculators determines a target probability data structure based on all the received local probability data structures, and determines an initial estimation result based on the target probability data structure and all the received random noise;
any one of the calculators determines a target base estimation result based on the initial estimation result.
2. The method of claim 1, wherein for each of the data providers, the local probabilistic data structure is determined by:
the data provider collects statistical data used for radix estimation, and performs Hash mapping on the statistical data to obtain a random bit string;
the data provider determines a data position corresponding to the established zero bit value probability data structure based on the random bit string;
and the data provider sets the data of the data position corresponding to the zero bit value probability data structure to be 1 to obtain the local probability data structure.
3. The method of claim 2, wherein for each of the data providers, said hash mapping the statistical data to obtain a random bit string comprises:
and the data provider acquires a hash key, and performs hash mapping on the statistical data and the hash key to obtain the random bit string.
4. The method of claim 2, wherein the local probabilistic data structure comprises a set number of bit strings, each of the bit strings comprising a one-dimensional bit vector of a set length.
5. The method of claim 1, wherein for each of said data providers, said data provider sends a local probabilistic data structure and random noise to a plurality of said calculators, comprising:
the data provider acquires data to be sent and a finite field, wherein the data to be sent is random noise or a single bit in the local probability data structure;
the data provider determines a plurality of secret sharing values corresponding to the data to be sent based on the finite field, wherein the number of the secret sharing values corresponding to the data to be sent is the same as the number of the calculators;
and the data provider sends each secret sharing value to each calculator, wherein the secret sharing values received by different calculators are different.
6. The method of claim 5, wherein for each of the data providers, the data provider determines a plurality of secret sharing values corresponding to the data to be sent based on the finite field, and the determining comprises:
a plurality of computing parties respectively generate random number share values corresponding to the data to be sent;
the data provider receives each random number share value, determines a random number corresponding to the data to be sent based on each random number share value, determines a difference value between the data to be sent and the random number corresponding to the data to be sent, and determines a plurality of secret sharing values corresponding to the difference value based on the finite field.
7. The method of any of claims 1 to 6, wherein determining, by any of the calculators, a target cardinality estimate based on the initial estimate comprises:
each computing party obtains a global key share corresponding to each computing party, determines a first verification result corresponding to the initial estimation result based on the initial estimation result and the global key share, and broadcasts the first verification result;
any one of the calculators receives first verification results of other calculators, and adds all the first verification results to obtain a second verification result;
if the second verification result is not equal to 0, stopping the cardinality estimation of all the data providers and all the calculators;
and if the second verification result is equal to 0, determining a target base number estimation result by any one of the calculators based on the initial estimation result.
8. A cardinality estimation system comprising a plurality of data providers and a plurality of calculators;
the data providers are used for sending the respectively established local probability data structures and the respectively generated random noise to the calculators;
each of the calculators is configured to determine a target probability data structure based on all the received local probability data structures, and determine an initial estimation result based on the target probability data structure and all the received random noise;
any one of the calculators is further configured to determine a target cardinality estimation result based on the initial estimation result.
9. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the cardinality estimation method of any one of claims 1-7.
10. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the cardinality estimation method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210866709.4A CN115270176A (en) | 2022-07-22 | 2022-07-22 | Radix estimation method, system, computing device and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210866709.4A CN115270176A (en) | 2022-07-22 | 2022-07-22 | Radix estimation method, system, computing device and computer storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115270176A true CN115270176A (en) | 2022-11-01 |
Family
ID=83767773
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210866709.4A Pending CN115270176A (en) | 2022-07-22 | 2022-07-22 | Radix estimation method, system, computing device and computer storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115270176A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116800637A (en) * | 2023-08-28 | 2023-09-22 | 北京傲星科技有限公司 | Method for estimating base number of data item in data stream and related equipment |
-
2022
- 2022-07-22 CN CN202210866709.4A patent/CN115270176A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116800637A (en) * | 2023-08-28 | 2023-09-22 | 北京傲星科技有限公司 | Method for estimating base number of data item in data stream and related equipment |
CN116800637B (en) * | 2023-08-28 | 2023-10-24 | 北京傲星科技有限公司 | Method for estimating base number of data item in data stream and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110419053B (en) | System and method for information protection | |
JP5564053B2 (en) | Method for generating encryption key, network and computer program | |
US11764943B2 (en) | Methods and systems for somewhat homomorphic encryption and key updates based on geometric algebra for distributed ledger/blockchain technology | |
CN114503509B (en) | Key-value mapping commitment system and method | |
CN112990276B (en) | Federal learning method, device, equipment and storage medium based on self-organizing cluster | |
EP4000216B1 (en) | Cryptographic pseudonym mapping method, computer system, computer program and computer-readable medium | |
US20230185960A1 (en) | Private Information Retrieval with Sublinear Public-Key Operations | |
Abadi et al. | Feather: Lightweight multi-party updatable delegated private set intersection | |
CN112332979A (en) | Ciphertext searching method, system and equipment in cloud computing environment | |
CN116324778A (en) | Updatable private collection intersections | |
CN115270176A (en) | Radix estimation method, system, computing device and computer storage medium | |
Corena et al. | Secure and fast aggregation of financial data in cloud-based expense tracking applications | |
Moldovyan et al. | A novel method for development of post-quantum digital signature schemes | |
WO2022233605A1 (en) | Blind rotation for use in fully homomorphic encryption | |
JP4772965B2 (en) | Method for proving entity authenticity and / or message integrity | |
WO2013153628A1 (en) | Calculation processing system and calculation result authentication method | |
CN116226466A (en) | Minimum community searching method, device, system and storage medium | |
CN117581507A (en) | System and method for performing operations | |
WO2006103608A2 (en) | Private negotiation | |
JP4598269B2 (en) | Fast finite field operations on elliptic curves | |
Tan et al. | Distributed Outsourced Privacy‐Preserving Gradient Descent Methods among Multiple Parties | |
Bunn et al. | Oblivious sampling with applications to two-party k-means clustering | |
US20240135024A1 (en) | Method and system for data communication with differentially private set intersection | |
CN114330758B (en) | Data processing method, device and storage medium based on federal learning | |
Yang et al. | Efficient data transfer supporting provable data deletion for secure cloud storage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |