CN114153319A - User multi-data scene oriented frequent character string mining method - Google Patents

User multi-data scene oriented frequent character string mining method Download PDF

Info

Publication number
CN114153319A
CN114153319A CN202111488643.1A CN202111488643A CN114153319A CN 114153319 A CN114153319 A CN 114153319A CN 202111488643 A CN202111488643 A CN 202111488643A CN 114153319 A CN114153319 A CN 114153319A
Authority
CN
China
Prior art keywords
data
user
frequency
character string
character strings
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111488643.1A
Other languages
Chinese (zh)
Other versions
CN114153319B (en
Inventor
刘晓琳
王宁
石恬
石佳鹭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202111488643.1A priority Critical patent/CN114153319B/en
Publication of CN114153319A publication Critical patent/CN114153319A/en
Application granted granted Critical
Publication of CN114153319B publication Critical patent/CN114153319B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
    • G06F3/0233Character input methods
    • G06F3/0237Character input methods using prediction or retrieval techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a user multi-data scene oriented frequent character string mining method, which comprises the following steps: 1) dividing users: dividing a user into two parts according to a truncation ratio, wherein one part is used for constructing a self-adaptive prefix tree, and the other part is used for enhancing the consistency of node support values; 2) initializing a root node, constructing a self-adaptive prefix tree from top to bottom, disturbing data by applying a wheel mechanism, and estimating the frequency number of the prefix corresponding to the node which is not '&'; 3) adding all character strings corresponding to the leaf nodes with the value of '&' into an alternative set; 4) applying a wheel mechanism to the user data of the second part which does not participate in the prefix tree construction to obtain frequency estimation of each character string in the alternative set; 5) and obtaining more accurate frequency estimation of the character strings of the alternative set through calculation, sequencing the character strings according to the frequency estimation, and finally selecting the most frequent character strings.

Description

User multi-data scene oriented frequent character string mining method
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a user multi-data scene oriented frequent character string mining method.
Background
Most smart phones now rely on software keyboards as text input means, and in order to facilitate users to use, mobile operating systems often provide users with some common words. In order to provide accurate word suggestion, a dictionary is required to be established, and the dictionary comprises words which are probably used by a user; building such a dictionary requires frequent character strings to be mined from the user side. Ideally, the mobile operating system can collect frequency information of character string usage from the user and submit the frequency information to the collector; the collector then filters out the top k character strings most frequently used by the user, includes them in the mobile keyboard dictionary, and pushes updates to the user's device. Clearly, such direct collection of string usage data would violate the privacy of the user.
Local Differential Privacy (LDP) has been established as a powerful Privacy standard for collecting user sensitive information without a trusted third party. In the LDP model, each user firstly disturbs own data locally and then sends the processed data to a collector, and the collector carries out statistical analysis on the collected data, so that an effective analysis result can be obtained, and the privacy information of the individual is also ensured not to be revealed. The frequent character string mining technology protects the privacy of users through an LDP model.
Normally, there is often more than one character string owned by a user. At present, algorithms such as IBSL and PrivTrie exist for the frequent string mining technology conforming to LDP. However, due to the limitation of the noise adding method, for a scene where a user has a plurality of character strings, the algorithms usually extract one character string from each character string owned by the user at random for analysis, and the random extraction method inevitably has a certain error, which results in a decrease in accuracy.
In addition, in real life, a character string owned by one user is often repeated, different character strings may also have the same prefix, and each round of mining in the PrivTrie mining process collects frequent prefixes of the character string of the user, so that data of the user is repeated. If a prefix or string is repeated more often, the frequency should be greater.
Disclosure of Invention
The invention aims to solve the problems in the prior art, and provides a mining method for frequent character strings facing multiple data scenes of a user, which adopts a wheel mechanism to replace an optimized random response mechanism as a noise adding method, enables a server to obtain unbiased distribution in a data domain, and further improves the accuracy.
The purpose of the invention can be realized by the following technical scheme: the mining method of the frequent character strings facing to the multi-data scene of the user comprises the following steps:
1) dividing users: dividing a user into two parts according to a truncation ratio, wherein one part is used for constructing a self-adaptive prefix tree, and the other part is used for enhancing the consistency of node support values; randomly dividing a first part of users into a plurality of groups with equal size for constructing a self-adaptive prefix tree in each round;
2) initializing a root node, and constructing a self-adaptive prefix tree from top to bottom:
(1) a user side:
a. setting the number of all non-accessed non-leaf nodes in the prefix tree as d, and numbering the nodes from 0; the path from the child node of the root node to each of these nodes constitutes a prefix, whereby the number of prefixes corresponding to the above nodes is also d;
b. establishing a null array for each user to store the data of the user; traversing all the character strings owned by the user, if the prefix of a certain character string is just one of the d prefixes, adding the serial number of the corresponding node of the prefix into a data array established for the user;
c. the data of the user is subjected to noise addition by using a randomizer of a wheel mechanism, and the randomizer randomly extracts one or more samples as output;
d. the user submits the output result of the randomizer to the server;
(2) the server side:
the server side counts the times that all elements in the output of a group of users are extracted as samples; obtaining frequency estimation of the whole data domain, namely frequency distribution of the prefixes according to statistical data through a decoder of a wheel mechanism so as to facilitate subsequent pruning and next round construction;
3) if not'&The frequency count estimate of'(string end) and its corresponding prefix c' (v) ≧ θ, where
Figure BDA0003397591340000021
Marking the node which is not visited and is not a leaf node, and expanding the node;
4) circulating the above operations, and finishing the construction of the prefix tree after all the groups of users in the first part participate in the construction or no nodes needing to be expanded; adding all character strings corresponding to the leaf nodes with the value of '&' into an alternative set;
5) and applying a wheel mechanism to the user data of which the second part does not participate in the prefix tree construction to obtain the frequency estimation of each character string in the alternative set:
(1) a user side:
a. the number of the character strings in the equipment selection set is d, and the character strings in the equipment selection set are numbered from 0;
b. establishing a data array for each user; traversing each character string owned by the user, and if a certain character string is just the character string in the alternative set, adding the serial number of the character string in the alternative set into the data of the user;
c. the user data is subjected to noise addition by using a randomizer of a wheel mechanism, and the randomizer randomly extracts one or more samples as output;
d. the user submits the output result of the randomizer to the server;
(2) the server side:
the server side counts the times that all elements in the output of the user are extracted as samples; obtaining frequency estimation of the whole data domain through a decoder of a wheel mechanism according to the statistical data, namely frequency distribution of each character string in the alternative set;
6) and obtaining more accurate frequency estimation of the character strings of the alternative set through calculation, sequencing the character strings according to the frequency estimation, and finally selecting the top k most frequent character strings.
In the method for mining frequent character strings facing to user multiple data scenes, the noise adding process is completed by means of a random function f of a wheel mechanism, wherein f satisfies epsilon-LDP, and if and only if t is any two input values t1And t2Arbitrarily output value t*Satisfies the constraint Pr [ f (t)1)=t*]≤eε·Pr[f(t2)=t*];
Wherein epsilon is a privacy budget and represents the intensity of privacy protection, and epsilon-LDP can ensure the tuple t after noise addition*Data collector can't exceed eεThe probability of (2) deducing that the original tuple is t1Or t2(ii) a This means that the smaller epsilon, the greater the degree of privacy protection.
The wheel mechanism maps data to one or more points on the cycle wheel and designs a calibrated probability distribution that satisfies epsilon-LDP based on the points; based on the resulting probability distribution, a digital value is sampled from the wheel as output data. This mechanism requires only O (log)2(eε+1)), since the mapping process is done by a user-specific hash function, the computational overhead is o (m), where m is the number of strings owned by each user. By designing a disturbance mechanism, the server can obtain unbiased estimation within an optimization error range.
In the mining method for frequent character strings of multiple data scenes of the user, in steps 2) and 1, the result of the randomizer is an array with a length d, and if i is extracted as a sample, the ith element of the array is 1, otherwise, the ith element is 0.
In the mining method for frequent character strings of multiple data scenes of the user, in step 4), (1), the result of the randomizer is an array with a length d, if i is extracted as a sample, the ith element of the array is 1, otherwise, it is 0.
In the method for mining frequent character strings in the user-oriented multiple data scenario, in steps 4 and 2, for the estimated frequency count c (v) of each character string, the final frequency count c' (v) is calculated as follows:
Figure BDA0003397591340000041
where λ is the truncation ratio of the user.
In the mining method for the frequent character strings of the multiple data scenes of the user, an encoding strategy for the repeated data is constructed:
(1) a user side:
for a group of user data, coding is carried out according to the maximum data repetition times of a single user, and the same data which appears for many times are respectively regarded as different data; setting the maximum repetition frequency of single data as mu, the size of an original data field C as d, and a new data field as C'; in C', let i, j count from 0, the (i · μ + j) th data represents the ith occurrence of the ith data in C; thus, the size of C' is d.mu.m; for each data of the user, if the ith data appears in the data of the user for the jth time, changing the data from i to (i · μ + j); the encoded user data is processed by a randomizer of a wheel mechanism, and a result is submitted to a server;
(2) the server side:
the frequency estimation obtained by the decoder of the wheel mechanism is restored back to the frequency estimation aiming at the original data field; let c' (v) be the frequency estimate derived by the decoder, and c (v) be the frequency estimate of i data in the original data fieldi) Is provided with
Figure BDA0003397591340000051
Therefore, the frequency estimation of the original data domain can be obtained, and the next operation is carried out.
Compared with the prior art, the mining method for the frequent character strings of the user multi-data scene has the following beneficial effects:
1. on the basis of the PrivTrie frequent character string mining technology, an optimized random response mechanism is replaced by a wheel mechanism, and consistency implementation based on the optimized random response mechanism is removed. The PrivTrie with the wheel mechanism as the noise adding method can complete the frequent character string mining of a plurality of character string scenes of a user and has higher accuracy.
2. The communication cost and the calculation overhead of the wheel mechanism are small, and by designing a disturbance mechanism, no matter the data of the user is set value data or classified data, the server side can obtain unbiased estimation in an optimized error range, and the method is suitable for a scene that the user has a plurality of character strings.
3. A coding strategy for repeated data is designed, the strategy solves the problem that the character strings owned by the user are repeated, and the strategy is suitable for both the process of constructing the self-adaptive prefix tree and the process of estimating the frequency of the character strings in the alternative set.
Drawings
FIG. 1 is a PrivTrie flow chart based on the wheel mechanism of the present invention.
Fig. 2 is a flow chart of a user terminal in the process of constructing a prefix tree in the present invention.
Fig. 3 is a flow chart of a server in the prefix tree construction process of the present invention.
Fig. 4 is a flow chart of the user end of the process of estimating the frequency of the character strings of the alternative set in the present invention.
FIG. 5 is a flow chart of a server side of the process of estimating the frequency of the character strings of the alternative set in the present invention.
Fig. 6 is a table of an example of a single user data implementation encoding strategy in the present invention.
FIG. 7 is a table illustrating an example of frequency estimates for restoring the original data field in accordance with the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings:
as shown in fig. 1, the present invention introduces a wheel mechanism (wheel mechanism) as a noise addition method. The wheel mechanism maps the data to one or more points on the wheel and then designs a calibrated probability distribution that satisfies epsilon-LDP based on these points. It then samples one or more values from the wheel as output according to the probability distribution. Through this carefully designed random distribution, the server can derive an unbiased distribution in the data domain. From the above information, for a scenario where the user has multiple data, the round-robin mechanism still allows the server to obtain an unbiased estimation.
The mining method for the frequent character strings of the user multi-data scene comprises the following steps:
1) dividing users: dividing a user into two parts according to a truncation ratio, wherein one part is used for constructing a self-adaptive prefix tree, and the other part is used for enhancing the consistency of node support values; randomly dividing a first part of users into a plurality of groups with equal size for constructing a self-adaptive prefix tree in each round;
2) initializing a root node, and constructing a self-adaptive prefix tree from top to bottom:
(1) a user side:
a. setting the number of all non-accessed non-leaf nodes in the prefix tree as d, and numbering the nodes from 0; the path from the child node of the root node to each of these nodes constitutes a prefix, whereby the number of prefixes corresponding to the above nodes is also d;
b. establishing a null array for each user to store the data of the user; traversing all the character strings owned by the user, if the prefix of a certain character string is just one of the d prefixes, adding the serial number of the corresponding node of the prefix into a data array established for the user;
c. the data of the user is subjected to noise addition by using a randomizer of a wheel mechanism, and the randomizer randomly extracts one or more samples as output; the result of randomizer is an array of length d, with the ith element of the array being 1 if i is decimated to sample and 0 otherwise.
d. The user submits the output result of the randomizer to the server;
(2) the server side:
the server side counts the times that all elements in the output of a group of users are extracted as samples; obtaining frequency estimation of the whole data domain, namely frequency distribution of the prefixes according to statistical data through a decoder of a wheel mechanism so as to facilitate subsequent pruning and next round construction;
3) if not'&The frequency count estimate of'(string end) and its corresponding prefix c' (v) ≧ θ, where
Figure BDA0003397591340000061
Marking the nodes which are not visited and are not leaf nodes, and expanding the nodes;
4) circulating the above operations, and finishing the construction of the prefix tree after all the groups of users in the first part participate in the construction or no nodes needing to be expanded; adding all character strings corresponding to the leaf nodes with the value of '&' into an alternative set;
5) and applying a wheel mechanism to the user data of which the second part does not participate in the prefix tree construction to obtain the frequency estimation of each character string in the alternative set:
(1) a user side:
a. the number of the character strings in the equipment selection set is d, and the character strings in the equipment selection set are numbered from 0;
b. establishing a data array for each user; traversing each character string owned by the user, and if a certain character string is just the character string in the alternative set, adding the serial number of the character string in the alternative set into the data of the user;
c. the user data is subjected to noise addition by using a randomizer of a wheel mechanism, and the randomizer randomly extracts one or more samples as output; the result of randomizer is an array of length d, with the ith element of the array being 1 if i is decimated to sample and 0 otherwise.
d. The user submits the output result of the randomizer to the server;
(2) the server side:
the server side counts the times that all elements in the output of the user are extracted as samples; obtaining frequency estimation of the whole data domain through a decoder of a wheel mechanism according to the statistical data, namely frequency distribution of each character string in the alternative set;
and for the estimation frequency count c (v) of each character string, calculating the frequency count estimation of the character strings of the more accurate alternative set as follows:
Figure BDA0003397591340000071
where λ is the truncation ratio of the user.
6) And sorting the character strings according to the frequency estimation, and finally selecting the most frequent first k character strings.
The noise addition is done by means of a random function f of the wheel mechanism, f satisfying ε -LDP, if and only if t is any two input values1And t2Arbitrarily output value t*Satisfies the constraint Pr [ f (t)1)=t*]≤eε·Pr[f(t2)=t*];
Wherein epsilon is a privacy budget and represents the intensity of privacy protection, and epsilon-LDP can ensure the tuple t after noise addition*Data collector can't exceed eεThe probability of (2) deducing that the original tuple is t1Or t2(ii) a This means that the smaller epsilon, the greater the degree of privacy protection.
The wheel mechanism maps data to one or more points on the cycle wheel and designs a calibrated probability distribution that satisfies epsilon-LDP based on the points; based on the resulting probability distribution, a digital value is sampled from the wheel as output data. This mechanism requires only O (log)2(eε+1)), since the mapping process is done by a user-specific hash function, the computational overhead is o (m), where m is the number of strings owned by each user. By designing a disturbance mechanism, the server can obtain unbiased estimation within an optimization error range.
The wheel mechanism is directed to categorizing data or set-value data, and as a result, the frequency advantage of repeating data cannot be embodied. Therefore, an encoding strategy for the repeated data is adopted:
(1) a user side:
for a group of user data, coding is carried out according to the maximum data repetition times of a single user, and the same data which appears for many times are respectively regarded as different data; setting the maximum repetition frequency of single data as mu, the size of an original data field C as d, and a new data field as C'; in C', i, j is counted from 0, and the (i · μ + j) th data represents the j th occurrence of the ith data in C; thus, the size of C' is d.mu.m; for each data of the user, if the ith data appears in the data of the user for the jth time, changing the data from i to (i · μ + j); the encoded user data is processed by randomizer of a wheel mechanism, and the result is submitted to a server.
In the example shown in FIG. 6, assuming that the original data field size d is 4, the maximum number of repetitions μ is 6, and the user data is {1, 1, 2, 2, 2, 2, 3, 3}, ijRepresenting the j-th occurrence of data i, the encoded data for that user can be represented as { 1}0,11,20,21,22,23,30,31I.e., it can be coded as 6, 7, 12, 13, 14, 15, 18, 19.
(2) The server side:
the frequency estimation obtained by the decoder of the wheel mechanism is restored back to the frequency estimation aiming at the original data field; let c' (v) be the frequency estimate derived by the decoder, and c (v) be the frequency estimate of i data in the original data fieldi) Is provided with
Figure BDA0003397591340000091
Therefore, the frequency estimation of the original data domain can be obtained, and the next operation is carried out.
The example shown in fig. 7 is an example of reducing the frequency estimates derived by the decoder back to the frequency estimates for the original data field. Let μ be 3 and d be 5, the frequency estimate from decoder is {55.8611, 13.8547, 5.4535, -2.94781, 5.45346, 22.256, 22.256, 39.0585
,13.8547, 13.8547, 30.6573, -2.94781, -2.94781, -2.94781, -2.94781}. According to the calculation of the above formula, the frequency of the original data domain can be estimated as {75.1693, 24.76165, 75.1692, 41.56419, -8.84343 }.
Compared with the prior art, the mining method for the frequent character strings of the user multi-data scene has the following beneficial effects:
1. on the basis of the PrivTrie frequent character string mining technology, an optimized random response mechanism is replaced by a wheel mechanism, and consistency implementation based on the optimized random response mechanism is removed. The PrivTrie with the wheel mechanism as the noise adding method can complete the frequent character string mining of a plurality of character string scenes of a user and has higher accuracy.
2. The communication cost and the calculation overhead of the wheel mechanism are small, and by designing a disturbance mechanism, no matter the data of the user is set value data or classified data, the server side can obtain unbiased estimation in an optimized error range, and the method is suitable for a scene that the user has a plurality of character strings.
3. A coding strategy for repeated data is designed, the strategy solves the problem that the character strings owned by the user are repeated, and the strategy is suitable for both the process of constructing the self-adaptive prefix tree and the process of estimating the frequency of the character strings in the alternative set.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims (6)

1. The mining method of the frequent character strings facing to the multi-data scene of the user is characterized by comprising the following steps:
1) dividing users: dividing a user into two parts according to a truncation ratio, wherein one part is used for constructing a self-adaptive prefix tree, and the other part is used for enhancing the consistency of node support values; then randomly dividing the first part of users into a plurality of groups with equal size for constructing a self-adaptive prefix tree in each round;
2) initializing a root node, and constructing a self-adaptive prefix tree from top to bottom, wherein each round of the process of constructing the prefix tree is as follows:
(1) a user side:
a. setting the number of all non-accessed non-leaf nodes in the prefix tree as d, and numbering the nodes from 0; the path from the child node of the root node to each of these nodes constitutes a prefix, whereby the number of prefixes corresponding to the above nodes is also d;
b. establishing a null array for each user to store the data of the user; traversing all the character strings owned by the user, if the prefix of a certain character string is just one of the d prefixes, adding the serial number of the corresponding node of the prefix into a data array established for the user;
c. the data of the user is subjected to noise addition by using a randomizer of a wheel mechanism, and the randomizer randomly extracts one or more samples as output;
d. the user submits the output result of the randomizer to the server;
(2) the server side:
the server side counts the times that all elements in the output of a group of users are extracted as samples; and the decoder of the wheel mechanism obtains the frequency estimation of the whole data field according to the statistical data, namely the frequency distribution of the prefixes. If not'&'and its frequency estimate for the prefix c' (v) ≧ θ, where
Figure FDA0003397591330000011
Marking the nodes which are not visited and are not leaf nodes, and expanding the nodes;
3) and operation in the loop 2), when all the groups of users in the first part participate in construction or do not have nodes needing to be expanded, the construction of the prefix tree is finished, and all the character strings corresponding to the leaf nodes with the value of '&' are added into an alternative selection set;
4) and applying a wheel mechanism to the user data of which the second part does not participate in the prefix tree construction to obtain the frequency estimation of each character string in the alternative set:
(1) a user side:
a. the number of the character strings in the equipment selection set is d, and the character strings in the equipment selection set are numbered from 0;
b. establishing a data array for each user; traversing each character string owned by the user, and if a certain character string is just the character string in the alternative set, adding the serial number of the character string in the alternative set into the data of the user;
c. the user data is subjected to noise addition by using a randomizer of a wheel mechanism, and the randomizer randomly extracts one or more samples as output;
d. the user submits the output result of the randomizer to the server;
(2) the server side:
the server side counts the times that all elements in the output of the user are extracted as samples; obtaining frequency estimation of the whole data domain through a decoder of a wheel mechanism according to the statistical data, namely frequency distribution of each character string in the alternative set;
5) and for the estimation frequency count c (v) of each character string, calculating the frequency count estimation c' (v) of the character string of the final alternative set as follows:
Figure FDA0003397591330000021
where λ is the truncation ratio of the user.
6) And sorting the character strings according to the frequency estimation, and finally selecting the most frequent first k character strings.
2. The method for mining frequent character strings for user-oriented multiple data scenes as claimed in claim 1, wherein it satisfies a localized Differential Privacy model (LDP), the denoising process of which is performed by means of a random function f, f satisfying epsilon-LDP, if and only if t is any two input values1And t2Arbitrarily output value t*Satisfies the constraint Pr [ f (t)1)=t*]≤eε·Pr[f(t2)=t*];
Wherein epsilon is a privacy budget and represents the intensity of privacy protection, and epsilon-LDP can ensure the tuple t after noise addition*Data collector can't exceed eεThe probability of (2) deducing that the original tuple is t1Or t2(ii) a It uses a wheel mechanism that maps data onto one or more points of the cycle wheel and sets the point based on those pointsCalculating a calibration probability distribution satisfying epsilon-LDP; based on the resulting probability distribution, a digital value is sampled from the wheel as output data.
3. The method as claimed in claim 1, wherein in step 2), (1) c, the result of said randomizer is an array with length d, if i is extracted as a sample, the ith element of the array is 1, otherwise it is 0.
4. The method as claimed in claim 1, wherein in step 4), (1) c, the result of said randomizer is an array with length d, if i is extracted as a sample, the ith element of the array is 1, otherwise it is 0.
5. The method as claimed in claim 1, wherein in steps 4) and 2, the frequency count c (v) for each estimated character string is calculated as follows to obtain a final frequency count c' (v):
Figure FDA0003397591330000031
where λ is the truncation ratio for the user.
6. The mining method of frequent character strings for user multiple data scenarios according to claim 1, wherein an encoding strategy for repeated data is constructed:
(1) a user side:
for a group of user data, coding is carried out according to the maximum data repetition times of a single user, and the same data which appears for many times are respectively regarded as different data; setting the maximum repetition frequency of single data as mu, the size of an original data field C as d, and a new data field as C'; in C', i, j is counted from 0, and the (i · μ + j) th data represents the j-th occurrence of the ith data in C; thus, the size of C' is d.mu.m; for each data of the user, if the ith data appears in the data of the user for the jth time, changing the data from i to (i · μ + j); the encoded user data is processed by a randomizer of a wheel mechanism, and a result is submitted to a server;
(2) the server side:
the frequency estimation obtained by the decoder of the wheel mechanism is restored back to the frequency estimation aiming at the original data field; let the frequency estimate derived from the decoder be c (v'), and c (v) be estimated for the frequency of i data in the original data fieldi) Is provided with
Figure FDA0003397591330000041
Therefore, the frequency estimation of the original data domain can be obtained, and the next operation is carried out.
CN202111488643.1A 2021-12-07 2021-12-07 Method for mining frequent character strings facing to user multi-data scene Active CN114153319B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111488643.1A CN114153319B (en) 2021-12-07 2021-12-07 Method for mining frequent character strings facing to user multi-data scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111488643.1A CN114153319B (en) 2021-12-07 2021-12-07 Method for mining frequent character strings facing to user multi-data scene

Publications (2)

Publication Number Publication Date
CN114153319A true CN114153319A (en) 2022-03-08
CN114153319B CN114153319B (en) 2024-06-21

Family

ID=80453234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111488643.1A Active CN114153319B (en) 2021-12-07 2021-12-07 Method for mining frequent character strings facing to user multi-data scene

Country Status (1)

Country Link
CN (1) CN114153319B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100328115A1 (en) * 2009-06-28 2010-12-30 Carsten Binnig Dictionary-based order-preserving string compression for main memory column stores
CN107273526A (en) * 2017-06-26 2017-10-20 云南大学 The greatly sub- frequently co location mode excavation methods in space
CN108475292A (en) * 2018-03-20 2018-08-31 深圳大学 Mining Frequent Itemsets, device, equipment and the medium of large-scale dataset
CN110471957A (en) * 2019-08-16 2019-11-19 安徽大学 Localization difference secret protection Mining Frequent Itemsets based on frequent pattern tree (fp tree)
CN113569286A (en) * 2021-03-26 2021-10-29 东南大学 Frequent item set mining method based on localized differential privacy

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100328115A1 (en) * 2009-06-28 2010-12-30 Carsten Binnig Dictionary-based order-preserving string compression for main memory column stores
CN107273526A (en) * 2017-06-26 2017-10-20 云南大学 The greatly sub- frequently co location mode excavation methods in space
CN108475292A (en) * 2018-03-20 2018-08-31 深圳大学 Mining Frequent Itemsets, device, equipment and the medium of large-scale dataset
CN110471957A (en) * 2019-08-16 2019-11-19 安徽大学 Localization difference secret protection Mining Frequent Itemsets based on frequent pattern tree (fp tree)
CN113569286A (en) * 2021-03-26 2021-10-29 东南大学 Frequent item set mining method based on localized differential privacy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
卢国庆;张啸剑;丁丽萍;李彦峰;廖鑫;: "差分隐私下的一种频繁序列模式挖掘方法", 计算机研究与发展, no. 12, 15 December 2015 (2015-12-15) *
龚艺;胡勇;方勇;刘亮;蒲伟;: "应用软件特征字符串挖掘技术", 信息安全与通信保密, no. 12, 10 December 2012 (2012-12-10) *

Also Published As

Publication number Publication date
CN114153319B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
CN111783875B (en) Abnormal user detection method, device, equipment and medium based on cluster analysis
CN106096066B (en) Text Clustering Method based on random neighbor insertion
CN110362997B (en) Malicious URL (Uniform resource locator) oversampling method based on generation countermeasure network
CN101561813B (en) Method for analyzing similarity of character string under Web environment
Heckman et al. Comparing the shapes of regression functions
CN104750798B (en) Recommendation method and device for application program
US20130167216A1 (en) Cloud identification processing and verification
KR101850993B1 (en) Method and apparatus for extracting keyword based on cluster
CN111866196B (en) Domain name traffic characteristic extraction method, device and equipment and readable storage medium
WO2017211150A1 (en) Processing method and device for storing fingerprint data in library
CN113569286B (en) Frequent item set mining method based on localized differential privacy
CN108280366B (en) Batch linear query method based on differential privacy
CN111814189B (en) Distributed learning privacy protection method based on differential privacy
CN113422695B (en) Optimization method for improving robustness of topological structure of Internet of things
CN110807547A (en) Method and system for predicting family population structure
CN114662157A (en) Block compressed sensing indistinguishable protection method and device for social text data stream
CN113407986A (en) Singular value decomposition-based frequent item set mining method for local differential privacy protection
CN109801073A (en) Risk subscribers recognition methods, device, computer equipment and storage medium
CN113609763A (en) Uncertainty-based satellite component layout temperature field prediction method
CN106339293B (en) A kind of log event extracting method based on signature
Wang et al. Locally Private Set-valued Data Analyses: Distribution and Heavy Hitters Estimation
CN103164533B (en) Complex network community detection method based on information theory
CN114153319B (en) Method for mining frequent character strings facing to user multi-data scene
CN117119535A (en) Data distribution method and system for mobile terminal cluster hot spot sharing
CN112348041A (en) Log classification and log classification training method and device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant