CN114153319A

CN114153319A - User multi-data scene oriented frequent character string mining method

Info

Publication number: CN114153319A
Application number: CN202111488643.1A
Authority: CN
Inventors: 刘晓琳; 王宁; 石恬; 石佳鹭
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-03-08
Anticipated expiration: 2041-12-07
Also published as: CN114153319B

Abstract

The invention provides a user multi-data scene oriented frequent character string mining method, which comprises the following steps: 1) dividing users: dividing a user into two parts according to a truncation ratio, wherein one part is used for constructing a self-adaptive prefix tree, and the other part is used for enhancing the consistency of node support values; 2) initializing a root node, constructing a self-adaptive prefix tree from top to bottom, disturbing data by applying a wheel mechanism, and estimating the frequency number of the prefix corresponding to the node which is not '&'; 3) adding all character strings corresponding to the leaf nodes with the value of '&' into an alternative set; 4) applying a wheel mechanism to the user data of the second part which does not participate in the prefix tree construction to obtain frequency estimation of each character string in the alternative set; 5) and obtaining more accurate frequency estimation of the character strings of the alternative set through calculation, sequencing the character strings according to the frequency estimation, and finally selecting the most frequent character strings.

Description

User multi-data scene oriented frequent character string mining method

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a user multi-data scene oriented frequent character string mining method.

Background

Most smart phones now rely on software keyboards as text input means, and in order to facilitate users to use, mobile operating systems often provide users with some common words. In order to provide accurate word suggestion, a dictionary is required to be established, and the dictionary comprises words which are probably used by a user; building such a dictionary requires frequent character strings to be mined from the user side. Ideally, the mobile operating system can collect frequency information of character string usage from the user and submit the frequency information to the collector; the collector then filters out the top k character strings most frequently used by the user, includes them in the mobile keyboard dictionary, and pushes updates to the user's device. Clearly, such direct collection of string usage data would violate the privacy of the user.

Local Differential Privacy (LDP) has been established as a powerful Privacy standard for collecting user sensitive information without a trusted third party. In the LDP model, each user firstly disturbs own data locally and then sends the processed data to a collector, and the collector carries out statistical analysis on the collected data, so that an effective analysis result can be obtained, and the privacy information of the individual is also ensured not to be revealed. The frequent character string mining technology protects the privacy of users through an LDP model.

Normally, there is often more than one character string owned by a user. At present, algorithms such as IBSL and PrivTrie exist for the frequent string mining technology conforming to LDP. However, due to the limitation of the noise adding method, for a scene where a user has a plurality of character strings, the algorithms usually extract one character string from each character string owned by the user at random for analysis, and the random extraction method inevitably has a certain error, which results in a decrease in accuracy.

In addition, in real life, a character string owned by one user is often repeated, different character strings may also have the same prefix, and each round of mining in the PrivTrie mining process collects frequent prefixes of the character string of the user, so that data of the user is repeated. If a prefix or string is repeated more often, the frequency should be greater.

Disclosure of Invention

The invention aims to solve the problems in the prior art, and provides a mining method for frequent character strings facing multiple data scenes of a user, which adopts a wheel mechanism to replace an optimized random response mechanism as a noise adding method, enables a server to obtain unbiased distribution in a data domain, and further improves the accuracy.

The purpose of the invention can be realized by the following technical scheme: the mining method of the frequent character strings facing to the multi-data scene of the user comprises the following steps:

1) dividing users: dividing a user into two parts according to a truncation ratio, wherein one part is used for constructing a self-adaptive prefix tree, and the other part is used for enhancing the consistency of node support values; randomly dividing a first part of users into a plurality of groups with equal size for constructing a self-adaptive prefix tree in each round;

2) initializing a root node, and constructing a self-adaptive prefix tree from top to bottom:

(1) a user side:

a. setting the number of all non-accessed non-leaf nodes in the prefix tree as d, and numbering the nodes from 0; the path from the child node of the root node to each of these nodes constitutes a prefix, whereby the number of prefixes corresponding to the above nodes is also d;

b. establishing a null array for each user to store the data of the user; traversing all the character strings owned by the user, if the prefix of a certain character string is just one of the d prefixes, adding the serial number of the corresponding node of the prefix into a data array established for the user;

c. the data of the user is subjected to noise addition by using a randomizer of a wheel mechanism, and the randomizer randomly extracts one or more samples as output;

d. the user submits the output result of the randomizer to the server;

(2) the server side:

the server side counts the times that all elements in the output of a group of users are extracted as samples; obtaining frequency estimation of the whole data domain, namely frequency distribution of the prefixes according to statistical data through a decoder of a wheel mechanism so as to facilitate subsequent pruning and next round construction;

3) if not'&The frequency count estimate of'(string end) and its corresponding prefix c' (v) ≧ θ, where

Marking the node which is not visited and is not a leaf node, and expanding the node;

4) circulating the above operations, and finishing the construction of the prefix tree after all the groups of users in the first part participate in the construction or no nodes needing to be expanded; adding all character strings corresponding to the leaf nodes with the value of '&' into an alternative set;

5) and applying a wheel mechanism to the user data of which the second part does not participate in the prefix tree construction to obtain the frequency estimation of each character string in the alternative set:

(1) a user side:

a. the number of the character strings in the equipment selection set is d, and the character strings in the equipment selection set are numbered from 0;

b. establishing a data array for each user; traversing each character string owned by the user, and if a certain character string is just the character string in the alternative set, adding the serial number of the character string in the alternative set into the data of the user;

c. the user data is subjected to noise addition by using a randomizer of a wheel mechanism, and the randomizer randomly extracts one or more samples as output;

d. the user submits the output result of the randomizer to the server;

(2) the server side:

the server side counts the times that all elements in the output of the user are extracted as samples; obtaining frequency estimation of the whole data domain through a decoder of a wheel mechanism according to the statistical data, namely frequency distribution of each character string in the alternative set;

6) and obtaining more accurate frequency estimation of the character strings of the alternative set through calculation, sequencing the character strings according to the frequency estimation, and finally selecting the top k most frequent character strings.

In the method for mining frequent character strings facing to user multiple data scenes, the noise adding process is completed by means of a random function f of a wheel mechanism, wherein f satisfies epsilon-LDP, and if and only if t is any two input values t₁And t₂Arbitrarily output value t^*Satisfies the constraint Pr [ f (t)₁)＝t^*]≤e^ε·Pr[f(t₂)＝t^*]；

Wherein epsilon is a privacy budget and represents the intensity of privacy protection, and epsilon-LDP can ensure the tuple t after noise addition^*Data collector can't exceed e^εThe probability of (2) deducing that the original tuple is t₁Or t₂(ii) a This means that the smaller epsilon, the greater the degree of privacy protection.

The wheel mechanism maps data to one or more points on the cycle wheel and designs a calibrated probability distribution that satisfies epsilon-LDP based on the points; based on the resulting probability distribution, a digital value is sampled from the wheel as output data. This mechanism requires only O (log)₂(e^ε+1)), since the mapping process is done by a user-specific hash function, the computational overhead is o (m), where m is the number of strings owned by each user. By designing a disturbance mechanism, the server can obtain unbiased estimation within an optimization error range.

In the mining method for frequent character strings of multiple data scenes of the user, in steps 2) and 1, the result of the randomizer is an array with a length d, and if i is extracted as a sample, the ith element of the array is 1, otherwise, the ith element is 0.

In the mining method for frequent character strings of multiple data scenes of the user, in step 4), (1), the result of the randomizer is an array with a length d, if i is extracted as a sample, the ith element of the array is 1, otherwise, it is 0.

In the method for mining frequent character strings in the user-oriented multiple data scenario, in

steps

4 and 2, for the estimated frequency count c (v) of each character string, the final frequency count c' (v) is calculated as follows:

where λ is the truncation ratio of the user.

In the mining method for the frequent character strings of the multiple data scenes of the user, an encoding strategy for the repeated data is constructed:

(1) a user side:

for a group of user data, coding is carried out according to the maximum data repetition times of a single user, and the same data which appears for many times are respectively regarded as different data; setting the maximum repetition frequency of single data as mu, the size of an original data field C as d, and a new data field as C'; in C', let i, j count from 0, the (i · μ + j) th data represents the ith occurrence of the ith data in C; thus, the size of C' is d.mu.m; for each data of the user, if the ith data appears in the data of the user for the jth time, changing the data from i to (i · μ + j); the encoded user data is processed by a randomizer of a wheel mechanism, and a result is submitted to a server;

(2) the server side:

the frequency estimation obtained by the decoder of the wheel mechanism is restored back to the frequency estimation aiming at the original data field; let c' (v) be the frequency estimate derived by the decoder, and c (v) be the frequency estimate of i data in the original data field_i) Is provided with

Therefore, the frequency estimation of the original data domain can be obtained, and the next operation is carried out.

Compared with the prior art, the mining method for the frequent character strings of the user multi-data scene has the following beneficial effects:

1. on the basis of the PrivTrie frequent character string mining technology, an optimized random response mechanism is replaced by a wheel mechanism, and consistency implementation based on the optimized random response mechanism is removed. The PrivTrie with the wheel mechanism as the noise adding method can complete the frequent character string mining of a plurality of character string scenes of a user and has higher accuracy.

2. The communication cost and the calculation overhead of the wheel mechanism are small, and by designing a disturbance mechanism, no matter the data of the user is set value data or classified data, the server side can obtain unbiased estimation in an optimized error range, and the method is suitable for a scene that the user has a plurality of character strings.

3. A coding strategy for repeated data is designed, the strategy solves the problem that the character strings owned by the user are repeated, and the strategy is suitable for both the process of constructing the self-adaptive prefix tree and the process of estimating the frequency of the character strings in the alternative set.

Drawings

FIG. 1 is a PrivTrie flow chart based on the wheel mechanism of the present invention.

Fig. 2 is a flow chart of a user terminal in the process of constructing a prefix tree in the present invention.

Fig. 3 is a flow chart of a server in the prefix tree construction process of the present invention.

Fig. 4 is a flow chart of the user end of the process of estimating the frequency of the character strings of the alternative set in the present invention.

FIG. 5 is a flow chart of a server side of the process of estimating the frequency of the character strings of the alternative set in the present invention.

Fig. 6 is a table of an example of a single user data implementation encoding strategy in the present invention.

FIG. 7 is a table illustrating an example of frequency estimates for restoring the original data field in accordance with the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings:

as shown in fig. 1, the present invention introduces a wheel mechanism (wheel mechanism) as a noise addition method. The wheel mechanism maps the data to one or more points on the wheel and then designs a calibrated probability distribution that satisfies epsilon-LDP based on these points. It then samples one or more values from the wheel as output according to the probability distribution. Through this carefully designed random distribution, the server can derive an unbiased distribution in the data domain. From the above information, for a scenario where the user has multiple data, the round-robin mechanism still allows the server to obtain an unbiased estimation.

The mining method for the frequent character strings of the user multi-data scene comprises the following steps:

(1) a user side:

c. the data of the user is subjected to noise addition by using a randomizer of a wheel mechanism, and the randomizer randomly extracts one or more samples as output; the result of randomizer is an array of length d, with the ith element of the array being 1 if i is decimated to sample and 0 otherwise.

d. The user submits the output result of the randomizer to the server;

(2) the server side:

Marking the nodes which are not visited and are not leaf nodes, and expanding the nodes;

(1) a user side:

c. the user data is subjected to noise addition by using a randomizer of a wheel mechanism, and the randomizer randomly extracts one or more samples as output; the result of randomizer is an array of length d, with the ith element of the array being 1 if i is decimated to sample and 0 otherwise.

d. The user submits the output result of the randomizer to the server;

(2) the server side:

and for the estimation frequency count c (v) of each character string, calculating the frequency count estimation of the character strings of the more accurate alternative set as follows:

where λ is the truncation ratio of the user.

6) And sorting the character strings according to the frequency estimation, and finally selecting the most frequent first k character strings.

The noise addition is done by means of a random function f of the wheel mechanism, f satisfying ε -LDP, if and only if t is any two input values₁And t₂Arbitrarily output value t^*Satisfies the constraint Pr [ f (t)₁)＝t*]≤e^ε·Pr[f(t₂)＝t^*]；

The wheel mechanism is directed to categorizing data or set-value data, and as a result, the frequency advantage of repeating data cannot be embodied. Therefore, an encoding strategy for the repeated data is adopted:

(1) a user side:

for a group of user data, coding is carried out according to the maximum data repetition times of a single user, and the same data which appears for many times are respectively regarded as different data; setting the maximum repetition frequency of single data as mu, the size of an original data field C as d, and a new data field as C'; in C', i, j is counted from 0, and the (i · μ + j) th data represents the j th occurrence of the ith data in C; thus, the size of C' is d.mu.m; for each data of the user, if the ith data appears in the data of the user for the jth time, changing the data from i to (i · μ + j); the encoded user data is processed by randomizer of a wheel mechanism, and the result is submitted to a server.

In the example shown in FIG. 6, assuming that the original data field size d is 4, the maximum number of repetitions μ is 6, and the user data is {1, 1, 2, 2, 2, 2, 3, 3}, i_jRepresenting the j-th occurrence of data i, the encoded data for that user can be represented as { 1}₀，1₁，2₀，2₁，2₂，2₃，3₀，3₁I.e., it can be coded as 6, 7, 12, 13, 14, 15, 18, 19.

(2) The server side:

The example shown in fig. 7 is an example of reducing the frequency estimates derived by the decoder back to the frequency estimates for the original data field. Let μ be 3 and d be 5, the frequency estimate from decoder is {55.8611, 13.8547, 5.4535, -2.94781, 5.45346, 22.256, 22.256, 39.0585

,13.8547, 13.8547, 30.6573, -2.94781, -2.94781, -2.94781, -2.94781}. According to the calculation of the above formula, the frequency of the original data domain can be estimated as {75.1693, 24.76165, 75.1692, 41.56419, -8.84343 }.

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims

1. The mining method of the frequent character strings facing to the multi-data scene of the user is characterized by comprising the following steps:

1) dividing users: dividing a user into two parts according to a truncation ratio, wherein one part is used for constructing a self-adaptive prefix tree, and the other part is used for enhancing the consistency of node support values; then randomly dividing the first part of users into a plurality of groups with equal size for constructing a self-adaptive prefix tree in each round;

2) initializing a root node, and constructing a self-adaptive prefix tree from top to bottom, wherein each round of the process of constructing the prefix tree is as follows:

(1) a user side:

d. the user submits the output result of the randomizer to the server;

(2) the server side:

the server side counts the times that all elements in the output of a group of users are extracted as samples; and the decoder of the wheel mechanism obtains the frequency estimation of the whole data field according to the statistical data, namely the frequency distribution of the prefixes. If not'&'and its frequency estimate for the prefix c' (v) ≧ θ, where

3) and operation in the loop 2), when all the groups of users in the first part participate in construction or do not have nodes needing to be expanded, the construction of the prefix tree is finished, and all the character strings corresponding to the leaf nodes with the value of '&' are added into an alternative selection set;

4) and applying a wheel mechanism to the user data of which the second part does not participate in the prefix tree construction to obtain the frequency estimation of each character string in the alternative set:

(1) a user side:

d. the user submits the output result of the randomizer to the server;

(2) the server side:

5) and for the estimation frequency count c (v) of each character string, calculating the frequency count estimation c' (v) of the character string of the final alternative set as follows:

where λ is the truncation ratio of the user.

2. The method for mining frequent character strings for user-oriented multiple data scenes as claimed in claim 1, wherein it satisfies a localized Differential Privacy model (LDP), the denoising process of which is performed by means of a random function f, f satisfying epsilon-LDP, if and only if t is any two input values₁And t₂Arbitrarily output value t^*Satisfies the constraint Pr [ f (t)₁)＝t^*]≤e^ε·Pr[f(t₂)＝t^*]；

Wherein epsilon is a privacy budget and represents the intensity of privacy protection, and epsilon-LDP can ensure the tuple t after noise addition^*Data collector can't exceed e^εThe probability of (2) deducing that the original tuple is t₁Or t₂(ii) a It uses a wheel mechanism that maps data onto one or more points of the cycle wheel and sets the point based on those pointsCalculating a calibration probability distribution satisfying epsilon-LDP; based on the resulting probability distribution, a digital value is sampled from the wheel as output data.

3. The method as claimed in claim 1, wherein in step 2), (1) c, the result of said randomizer is an array with length d, if i is extracted as a sample, the ith element of the array is 1, otherwise it is 0.

4. The method as claimed in claim 1, wherein in step 4), (1) c, the result of said randomizer is an array with length d, if i is extracted as a sample, the ith element of the array is 1, otherwise it is 0.

5. The method as claimed in claim 1, wherein in steps 4) and 2, the frequency count c (v) for each estimated character string is calculated as follows to obtain a final frequency count c' (v):

where λ is the truncation ratio for the user.

6. The mining method of frequent character strings for user multiple data scenarios according to claim 1, wherein an encoding strategy for repeated data is constructed:

(1) a user side:

for a group of user data, coding is carried out according to the maximum data repetition times of a single user, and the same data which appears for many times are respectively regarded as different data; setting the maximum repetition frequency of single data as mu, the size of an original data field C as d, and a new data field as C'; in C', i, j is counted from 0, and the (i · μ + j) th data represents the j-th occurrence of the ith data in C; thus, the size of C' is d.mu.m; for each data of the user, if the ith data appears in the data of the user for the jth time, changing the data from i to (i · μ + j); the encoded user data is processed by a randomizer of a wheel mechanism, and a result is submitted to a server;

(2) the server side:

the frequency estimation obtained by the decoder of the wheel mechanism is restored back to the frequency estimation aiming at the original data field; let the frequency estimate derived from the decoder be c (v'), and c (v) be estimated for the frequency of i data in the original data field_i) Is provided with