WO2022166676A1 - 估计差分隐私保护数据中分词频度的方法及装置 - Google Patents
估计差分隐私保护数据中分词频度的方法及装置 Download PDFInfo
- Publication number
- WO2022166676A1 WO2022166676A1 PCT/CN2022/073677 CN2022073677W WO2022166676A1 WO 2022166676 A1 WO2022166676 A1 WO 2022166676A1 CN 2022073677 W CN2022073677 W CN 2022073677W WO 2022166676 A1 WO2022166676 A1 WO 2022166676A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- grams
- word
- node
- candidate
- frequency
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000011218 segmentation Effects 0.000 claims description 161
- 239000013598 vector Substances 0.000 claims description 44
- 230000006870 function Effects 0.000 claims description 33
- 230000001174 ascending effect Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 abstract description 8
- 230000008569 process Effects 0.000 abstract description 7
- 238000010586 diagram Methods 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 5
- 238000007418 data mining Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
Definitions
- One or more embodiments of this specification relate to the technical field of data mining, and in particular, to a method and apparatus for estimating the frequency of word segmentation in differential privacy-preserving data.
- the text information (such as messages, chat records, search records, etc.) entered or viewed by users through terminal devices can directly or indirectly reflect the characteristics and preferences of users. These text information is extremely important for data mining and analysis. However, these text information also involves the user's personal privacy. Therefore, it is usually possible to perform local differential privacy processing on the text information input or viewed by the user through the terminal device to obtain differential privacy protection data, and report the differential privacy protection data to the server, and the server estimates the word segmentation frequency (word segmentation frequency) in the differential privacy protection data. occurrences in the text). Therefore, in the process of estimating the frequency of word segmentation, how to estimate the frequency of word segmentation more efficiently and reasonably with less calculation amount becomes particularly important in the field of data mining.
- word segmentation frequency word segmentation frequency
- one or more embodiments of this specification provide a method and apparatus for estimating the frequency of word segmentation in differential privacy protection data.
- a method for estimating word segmentation frequency in differential privacy protection data is provided, which is applied to a server, including: acquiring each word segmentation information reported by a terminal device and processed by local differential privacy; any word segmentation information corresponds to a word segmentation, And include the target number representing the number of word units contained in the word segmentation, the target number is less than or equal to the preset value N; divide N groups of word information, so that each word segmentation information in the same group corresponds to the same target number; determine Each group of word information corresponding to each group of estimation data representing the unbiased estimation of word segmentation frequency; based on each group of estimation data, each layer node of the prefix tree for recording word segmentation frequency is generated layer by layer, wherein, the nth The layer nodes include: obtaining each n-1-gram segment represented by each node in the n-1th layer, and the n-1-gram segment represented by any node in the n-1th layer is passed from the root node to the node.
- Corresponding word units are arranged in sequence; based on the respective n-1-gram word segments, multiple candidate n-gram word segments for the n-th layer node are determined; based on the n-th group of estimated data corresponding to the target number n, Calculate the frequency saliency distribution information of the candidate n-grams; based on the frequency saliency distribution information, select several candidate n-grams as the n-grams represented by the nth layer nodes, and use the n-grams in the nth layer. Each node of , records the frequency of each n-gram segment represented by each node; 1 ⁇ n ⁇ N.
- the root node of the prefix tree is a level 0 node, and the level 0 node represents a null character.
- the determining a plurality of candidate n-grams for the n-th layer node based on the respective n-1-grams includes: using the n-1-grams as prefixes, which are respectively associated with presets. Multiple n-grams formed by each preset word unit in the dictionary are determined as the multiple candidate n-grams.
- calculating the frequency saliency distribution information of the candidate n-gram word segmentation based on the nth group of estimated data corresponding to the target number of n includes: based on the nth group of estimated data, calculating each The respective frequencies of the candidate n-grams; based on the respective frequencies, calculate the respective variances corresponding to the respective alternative n-grams; based on the respective variances, calculate the frequency of the candidate n-grams is significant Sex distribution information.
- calculating the frequency saliency distribution information of the candidate n-grams based on the respective variances includes: calculating, based on the respective variances, respective z corresponding to the respective candidate n-grams. value; based on the respective z values, calculate the respective p values corresponding to the respective candidate n-grams, as the frequency significance distribution information of the candidate n-grams; wherein, based on the frequency
- selecting several candidate n-grams as the n-grams represented by the n-th layer nodes includes: selecting several candidate n-grams as the n-grams represented by the n-th level nodes based on the respective p values.
- selecting several candidate n-grams based on the respective p-values as the n-grams represented by the nth layer nodes includes: arranging the respective p-values in ascending order; The maximum p value of the preset condition is used as the target p value; any p value that satisfies the preset condition is less than or equal to the target result corresponding to the p value, and the target result is the sequence number of the p value in the arrangement and the target result for the first The result obtained by dividing the product of the preset thresholds set by n layers by the number of candidate n-grams; selecting each candidate n-gram that corresponds to each p value smaller than the target p value as the nth layer The n-gram participle represented by the node.
- the method further includes: using each node in the nth layer to record the variance and p value of each n-gram segment represented by each node.
- any word segmentation information further includes a target vector representing the word segmentation, and the target vector is processed by local differential privacy.
- the target vector representing the word segmentation is obtained in the following manner: a hash function is selected from a plurality of preset hash functions as the target hash function; the target hash function is used to calculate the target of the word segmentation. hash value;
- the target vector is determined based on the target hash value in a manner that satisfies differential privacy.
- an apparatus for estimating the frequency of word segmentation in differential privacy protection data which is applied to a server and includes: an acquisition module for acquiring each word segmentation information reported by a terminal device and processed by local differential privacy; any word segmentation The information corresponds to a word segment, and includes a target number representing the number of word units contained in the word segment, and the target number is less than or equal to the preset value N; the grouping module is used to divide N groups of word information, so that each The word segmentation information corresponds to the same target number; the determining module is used to determine each group of estimated data corresponding to each component word information representing the unbiased estimation of the word segmentation frequency; the generating module is used for each group of estimated data.
- Layer generation is used to record each layer node of the prefix tree of word segmentation frequency, wherein, the generation module generates the nth layer node in the following way: obtain each n-1 yuan word segmentation represented by each node in the n-1th layer.
- the n-1 metaword represented by any node in the n-1th layer is formed by sequentially arranging the word units corresponding to the node from the root node; Multiple candidate n-gram word segmentations of n-level nodes; based on the nth group of estimated data corresponding to the target number n, calculate the frequency saliency distribution information of the candidate n-gram word segmentation; based on the frequency saliency Distribution information, select several candidate n-grams as the n-grams represented by the nth layer nodes, and use each node in the nth layer to record the frequency of each n-grams represented by each node; 1 ⁇ n ⁇ N.
- the root node of the prefix tree is a level 0 node, and the level 0 node represents a null character.
- the generation module determines a plurality of candidate n-grams for the n-th layer node based on the respective n-1-grams in the following manner: using the n-1-grams as prefixes, respectively. Multiple n-grams formed with each preset word unit in the preset dictionary are determined as the multiple candidate n-grams.
- the generation module calculates the frequency saliency distribution information of the candidate n-gram word segmentation based on the nth group of estimated data corresponding to the target number n in the following manner: based on the nth group of estimated data , calculate each frequency of each candidate n-gram segment; based on the respective frequencies, calculate the respective variances corresponding to each of the candidate n-grams; based on the respective variances, calculate the Frequency significance distribution information.
- the generation module calculates the frequency significance distribution information of the candidate n-grams based on the respective variances in the following manner: based on the respective variances, calculates the corresponding corresponding to each of the candidate n-grams.
- the generation module selects several candidate n-grams as the n-grams represented by the nth layer nodes based on the respective p-values in the following manner: Arrange the p-values in ascending order. Select the maximum p value that satisfies the preset condition as the target p value; any p value that satisfies the preset condition is less than or equal to the target result corresponding to the p value, and the target result is the sequence number of the p value in the arrangement and the result obtained by dividing the product of the preset threshold value set for the nth layer by the number of alternative n-grams; select each alternative n-grams corresponding to each p value less than the target p value, as the The n-gram segment represented by the nth level node.
- the generating module is further configured to: use each node in the nth layer to record the variance and p value of each n-gram segment represented by each node.
- any word segmentation information further includes a target vector representing the word segmentation, and the target vector is processed by local differential privacy.
- the target vector representing the word segmentation is obtained in the following manner: a hash function is selected from a plurality of preset hash functions as the target hash function; the target hash function is used to calculate the target of the word segmentation. Hash value; determine the target vector based on the target hash value in a manner that satisfies differential privacy.
- a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the method according to any one of the above-mentioned first aspects.
- an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing any one of the first aspects when executing the program the method described.
- the technical solutions provided by the embodiments of this specification may include the following beneficial effects: the method and apparatus for estimating the frequency of word segmentation in differential privacy protection data provided by the embodiments of this specification
- Word segmentation information divide N groups of word information, so that each word segmentation information in the same group corresponds to the same target number, determine each group of estimated data corresponding to each component word information that represents the unbiased estimation of word segmentation frequency, and based on each group Estimate the data, and generate the nodes of each layer of the prefix tree for recording the frequency of word segmentation.
- this embodiment can select some candidate n-grams as the n-grams represented by the n-level nodes based on the frequency saliency distribution information of the candidate n-grams in the process of generating the nth level node of the prefix tree, There is no need to traverse all the n-grams composed of preset word units, which not only greatly reduces the amount of calculation, but also improves the computational efficiency, and the n-grams represented by the n-th layer nodes selected based on the frequency saliency distribution information of word segmentation. more reasonable.
- FIG. 1 is a schematic diagram of a scenario for estimating the frequency of word segmentation in differential privacy protection data according to an exemplary embodiment of the present specification
- FIG. 2 is a flowchart of a method for estimating word segmentation frequency in differential privacy protection data according to an exemplary embodiment of the present specification
- FIG. 3 is a flowchart of another method for estimating the frequency of word segmentation in differential privacy protection data shown in this specification according to an exemplary embodiment
- FIG. 4 is a block diagram of an apparatus for estimating word segmentation frequency in differential privacy protection data according to an exemplary embodiment of the present specification
- FIG. 5 is a schematic diagram of a prefix tree shown in this specification according to an exemplary embodiment
- FIG. 6 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present description.
- first, second, third, etc. may be used in this application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
- first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application.
- word "if” as used herein can be interpreted as "at the time of” or "when” or "in response to determining.”
- each word segmentation consists of one or more It is composed of multiple word units (for example, each participle in this scenario is composed of at most 4 word units).
- the terminal device performs local differential privacy processing on each obtained word segment, obtains each target vector corresponding to each word segment, and generates each corresponding word segment information for each word segment.
- the word segmentation information corresponding to any word segmentation may include the target vector corresponding to the word segmentation and the target number of word units constituting the word segmentation.
- the terminal device reports the obtained word segmentation information to the server.
- the server receives the word segmentation information reported by a plurality of terminal devices, and aggregates the received word segmentation information and groups them, so that the number of targets corresponding to the same grouped word information is equal. For example, the word segmentation information corresponding to the word segmentation consisting of one word unit is grouped into one group as the first group word information. The word segmentation information corresponding to the word segmentation composed of 2 word units is divided into a group as the second component word information, and so on. In this scenario, a total of 4 words including the third component word information and the fourth component word information can be obtained. Component word information.
- each group of estimated data representing the unbiased estimation of the word segmentation frequency corresponding to each component word information is determined.
- the first set of estimated data may be determined based on the first set of word information, and the first set of estimated data may represent an unbiased estimate of the frequency of a word segment composed of one word unit.
- the second group of estimated data can be determined based on the second group of word information.
- the second group of estimated data can represent the unbiased estimation of the frequency of the word segmentation composed of two word units, and so on.
- the third group can be obtained.
- a prefix tree for recording the frequency of word segmentation can be generated and output as a result of word segmentation frequency estimation.
- the root node of the prefix tree may be generated first as a layer 0 node.
- each node of the nth layer is obtained based on the nth group of estimated data.
- Each node of the nth layer corresponds to an n-gram segment consisting of n word units, and the corresponding n-gram is recorded in each node of the nth layer. frequency of word segmentation.
- each node of the first layer can be obtained based on the first set of estimated data.
- Each node of the first layer corresponds to a unigram consisting of one word unit, and the corresponding 1 is recorded in each node of the first layer.
- Each node of the second layer can be obtained based on the second set of estimated data.
- Each node of the second layer corresponds to a 2-gram word segment composed of 2 word units, and the corresponding 2-gram word segment is recorded in each node of the second layer.
- the frequency of , and so on, in this scenario, the prefix tree also includes the third-level and fourth-level nodes.
- the method shown in FIG. 2 can be applied to a server, and the server can be implemented as any device, platform, server or device cluster with computing and processing capabilities.
- the method includes the following steps: in step 201, each word segmentation information reported by a terminal device and processed by local differential privacy is obtained.
- the involved terminal device may be any terminal device capable of inputting or viewing text information.
- the terminal device may include, but is not limited to, a mobile terminal device such as a smart phone, a smart wearable device, a tablet computer, a personal digital assistant, a laptop computer, a desktop computer, and the like.
- the server may acquire multiple pieces of word segmentation information reported by multiple terminal devices.
- Any piece of word segmentation information corresponds to a word segmentation consisting of one or more word units, and the word segmentation information may include a target vector and a target number, where the target vector represents the corresponding word segmentation and undergoes local differential privacy processing, and the target number represents the composition of the word segmentation.
- the number of word units of , the target number is less than or equal to the preset value N.
- a preset word segmentation operation may be performed on the text information input or viewed by the user based on a preset dictionary, so as to disassemble the text information into the preset dictionary word unit. Then, according to the disassembled word units, multiple word segments are combined, and each word segment is composed of one or more word units. In addition, the number of word units constituting each participle is less than or equal to the preset value N.
- the word unit may be a word or a word.
- the terminal device performs local differential privacy processing on each word segment to obtain a target vector corresponding to each word segment.
- the target vector corresponding to any participle can be obtained in the following manner: First, randomly select a hash function from a plurality of preset hash functions as the target hash function. The target hash function is used to perform hash calculation on the segmented word to obtain the target hash value of the segmented word. Finally, the target vector is determined based on the target hash value in a way that satisfies differential privacy.
- the character string corresponding to the segmented word is S.
- a hash function H j may be randomly selected from k preset hash functions H 1 , H 2 . . . H k as the target hash function. Among them, j is the number corresponding to H j .
- a random vector v is generated, and the random vector v takes the value of 1 or -1 in each dimension, and the probability of the value of -1 in each dimension is:
- ⁇ is the preset privacy budget, which is used to represent the privacy protection level.
- step 202 N groups of word information are divided, so that each word segmentation information in the same group corresponds to the same target number.
- multiple pieces of word segmentation information from multiple terminal devices can be aggregated and grouped to divide N groups of word information, so that each word segmentation information in the same group corresponds to the same target number.
- the preset value N 3, three groups of word information G1, G2 and G3 can be separated, wherein the number of targets corresponding to each word segmentation information in the G1 group is 1, and the target number corresponding to each word segmentation information in the G2 group The number is 2, and the target number corresponding to each word segmentation information in the G3 group is 3.
- step 203 each group of estimated data corresponding to each component word information and representing the unbiased estimation of the word segmentation frequency is determined.
- the estimated data representing the unbiased estimation of word segmentation frequency corresponding to the piece of word information can be determined. Because N groups of word information are divided, N groups of estimated data can be obtained, and each estimated data of the same group also corresponds to the same target number.
- any piece of word segmentation information may include a target vector and the number of targets, wherein the word segmentation information may also include the number j of the target hash function used to obtain the target vector.
- the reference vector corresponding to each word segment information in the component word information can be calculated by the following formula:
- Represents the reference vector corresponding to the i-th word segmentation information is a unit vector with the same dimension as the target vector
- k is the number of preset hash functions
- c is a constant
- c can be expressed as:
- ⁇ is a preset privacy budget, which is equal to the privacy budget ⁇ involved in a specific implementation provided in step 201 .
- the component word information is added according to the number of the corresponding target hash function, and the reference vectors corresponding to the word segmentation information with the same number are added to obtain k vectors. Take these k vectors as row vectors or column vectors, and arrange them in ascending order to obtain the target matrix.
- the target matrix is a set of estimated data representing the unbiased estimation of the word segmentation frequency corresponding to the component word information.
- each set of estimated data can be determined in different ways. This embodiment does not limit the specific manner of determining each group of estimated data.
- each layer of nodes of the prefix tree for recording the frequency of word segmentation is generated layer by layer.
- each layer of nodes of the prefix tree for recording the frequency of word segmentation may be generated layer by layer based on each set of estimated data, thereby obtaining the prefix tree. Specifically, first, the root node of the prefix tree is generated, and the root node of the prefix tree is taken as the 0th level node, and the 0th level node represents a null character. Next, starting from the first layer, the nodes of each layer of the prefix tree are generated layer by layer.
- Step a Obtain each n-1 represented by each node in the n-1th layer meta-participle.
- each n-1-gram word segment represented by each node in the n-1th layer is obtained.
- the n-1-gram word segment represented by any node in the n-1th layer is formed by sequentially arranging word units corresponding to the node from the root node.
- the 0-element word segment represented by the node at the 0th layer that is, the null character, is obtained.
- Step b Based on each of the n-1-grams, determine multiple candidate n-grams for the nth layer node.
- a plurality of candidate n-grams for the n-th layer node may be determined based on each n-1-gram. Specifically, each n-1-gram segment may be used as a prefix to be combined with each preset word unit in the preset dictionary, and a plurality of n-gram word segments thus formed may be determined as a plurality of candidate n-gram word segments.
- the n-1-element word segmentation represented by the 2 nodes in the n-1 layer node is x and w respectively (x and w are both n-1-element word segmentation composed of n-1 word units)
- the formed n-grams are xA, xB, xC, prefixed with w, combined with A, B, C respectively
- the formed n-grams are wA, wB, and wC.
- xA, xB, xC, wA, wB, wC can be determined as alternative n-grams.
- Step c Calculate the frequency saliency distribution information of the candidate n-gram word segmentation based on the nth group of estimated data corresponding to the target number n.
- the frequency saliency distribution information of the candidate n-grams may be calculated based on the nth group of estimated data.
- the number of targets corresponding to the nth group of estimated data is n.
- the frequency saliency distribution information of the candidate n-grams may be calculated in the following manner: First, each frequency of each candidate n-gram may be calculated based on the nth group of estimated data. Referring to a specific implementation provided in step 203, the target matrix is obtained as a set of estimated data representing the unbiased estimation of word segmentation frequency, and the nth matrix is the nth set of estimated data. For any candidate n-gram segment D, the above k preset hash functions H 1 , H 2 .
- each variance corresponding to each candidate n-gram may be calculated based on each frequency of each candidate n-gram, so as to obtain a standard deviation corresponding to each variance.
- each z value corresponding to each candidate n-gram is calculated. Wherein, for any candidate n-gram segment, the frequency of the candidate n-gram segment is divided by the standard deviation corresponding to the candidate n-gram segment to obtain the z value corresponding to the candidate n-gram segment.
- each p-value corresponding to each candidate n-gram can be calculated based on the respective z-values corresponding to each candidate n-gram (the p-value corresponding to a sample is extracted from the sample or extracted to a more extreme value than the sample).
- the probability of the case sample as the frequency significance distribution information corresponding to each candidate n-gram.
- the p value corresponding to the candidate n-gram segment is the probability that the random variable in the standard normal distribution is greater than the z value corresponding to the candidate n-gram segment.
- other frequency-based indicators may also be selected as the significance distribution information, for example, the aforementioned z value, or other statistical distribution quantities determined based on the z value, may be used as the significance distribution information.
- Step d Based on the above frequency saliency distribution information, select several candidate n-grams as the n-grams represented by the nth layer nodes, and use each node in the nth layer to record the frequencies of each n-grams represented by them. .
- At least one candidate n-gram segment that satisfies a certain condition may be selected from a plurality of candidate n-gram segments, as the n-gram segment represented by the nth layer node, and Use each node in the nth layer to record the frequency of each n-gram segment represented by each node.
- each node in the nth layer can also be used to record the variance and p-value of each n-gram segment represented by each node.
- the candidate n-grams include A, B, C, D, E, and F.
- A, B, and C that satisfy certain conditions can be selected from the candidate n-grams as the first An n-gram segment represented by an n-level node.
- three nodes a, b, and c of the nth layer are generated, and the three nodes a, b, and c represent A, B, and C, respectively.
- node a records the frequency, variance, and p value of A
- node b records the frequency, variance, and p value of B
- node c records the frequency, variance, and p value of C.
- the method for estimating the word segmentation frequency in the differential privacy protection data divides N groups of word information into N groups of word segmentation information by acquiring each word segmentation information reported by a terminal device and processed by local differential privacy, so that each word segmentation in the same group The information corresponds to the same target number, and each group of estimated data corresponding to each component word information representing the unbiased estimation of the word segmentation frequency is determined, and based on each group of estimated data, the prefix tree used to record the word segmentation frequency is generated layer by layer. nodes at each level.
- this embodiment can select some candidate n-grams as the n-grams represented by the n-level nodes based on the frequency saliency distribution information of the candidate n-grams in the process of generating the nth level node of the prefix tree, There is no need to traverse all the n-grams composed of preset word units, which not only greatly reduces the amount of calculation, but also improves the computational efficiency, and the n-grams represented by the n-th layer nodes selected based on the frequency saliency distribution information of word segmentation. more reasonable.
- the embodiment of FIG. 3 describes the process of selecting several candidate n-grams as the n-grams represented by the nth layer nodes.
- the method can be applied to the server, and includes the following steps: In step 301, the The respective p-values corresponding to each of the candidate n-grams are arranged in ascending order.
- step 302 the maximum p value satisfying the preset condition is selected as the target p value.
- the target result is the result obtained by dividing the product of the sequence number of the p value in the arrangement and the preset threshold set for the nth layer by the number of candidate n-grams.
- the p-values corresponding to the candidate n-grams are arranged in ascending order, and p i is used to represent the i-th p-value in the arrangement. If p i ⁇ (i/N)* ⁇ n , then p i satisfies the preset condition. Take the largest p-value that satisfies the preset condition as the target p-value.
- ⁇ n is a preset threshold value set for the nth layer node. Generally speaking, the larger the n is, the larger the ⁇ n is.
- each candidate n-gram word segment corresponding to each p value smaller than the target p value is selected as the n-gram word segment represented by the nth layer node.
- the method for estimating the word segmentation frequency in the differential privacy protection data is by arranging the respective p-values corresponding to each candidate n-gram word segmentation in ascending order, and selecting the maximum value that satisfies the preset conditions.
- the p value is used as the target p value, and each candidate n-gram segment corresponding to each p value smaller than the target p value is selected as the n-gram segment represented by the nth layer node.
- the n-gram segment represented by the n-th layer node selected based on each p-value of the segment is further reasonable.
- the above method may further include: using each node in the nth layer to record the variance and p value of each n-gram segment represented by each node.
- the application scenario may be: the preset dictionary includes word units A, B, C, and D, and the server needs to estimate the frequency of 1-element word segmentation, 2-element word segmentation and 3-element word segmentation in the differential privacy protection data based on the preset dictionary.
- the server obtains each word segmentation information reported by the terminal device and processed by local differential privacy. And divide the first component word information, the second component word information and the third component word information.
- the target number corresponding to the first component word information is all 1
- the target number corresponding to the second component word information is all 2
- the target number corresponding to the third component word information is all 3.
- the third group of estimated data corresponding to the information represents the unbiased estimation of the word segmentation frequency.
- each layer node of the prefix tree used to record the word segmentation frequency is generated layer by layer.
- the root node of the prefix tree can be generated first as the 0th layer node, and the 0th layer node represents a null character.
- the word units A, B, C, and D in the preset dictionary are determined as four candidate unary word segments. Calculate the frequency saliency distribution information corresponding to each candidate 1-gram word segmentation respectively. Based on the frequency saliency distribution information corresponding to each candidate 1-gram word segmentation, select A, B, C, and D as the 1-gram represented by the first layer node. Participle.
- the node a represents the 1-gram word A, and use the node a to record the frequency of the 1-gram word A.
- the child node b of the root node is constructed, the node b represents the 1-gram segment B, and the frequency of the 1-gram segment B is recorded by using the node b.
- the child node c of the root node is constructed, the node c represents the 1-gram C, and the frequency of the 1-gram C is recorded by using the node c.
- the child node d of the root node is constructed, the node d represents the 1-gram segment D, and the frequency of the 1-gram segment D is recorded by the node d.
- node a, node b, node c, and node d are the nodes of the first layer of the prefix tree.
- the node bc represents the 2-gram BC, and the frequency of the 2-gram BC is recorded using the node bc.
- the node bd represents the 2-gram BD, and the frequency of the 2-gram BD is recorded by the node bd.
- the node ab, the node ac, the node bc, and the node bd are the nodes of the second layer of the prefix tree.
- the node abb represents the 3-gram ABB, and the frequency of the 3-gram ABB is recorded by the node abb.
- the node acc represents the 3-gram ACC, and the frequency of the 3-gram ACC is recorded by the node acc.
- the node acd represents the 3-gram ACD, and the frequency of the 3-gram ACD is recorded by the node acd.
- the node bdc represents the 3-gram BDC, and the frequency of the 3-gram BDC is recorded by the node bdc.
- the node bdd represents the 3-gram BDD, and the frequency of the 3-gram BDD is recorded by the node bdd.
- the node aba, the node abb, the node acc, the node acd, the node bdc, and the node bdd are the nodes of the third layer of the prefix tree.
- the nodes of each layer of the prefix tree used to record the frequency of word segmentation are generated layer by layer.
- the frequency saliency distribution information of the candidate n-gram word segmentation can be selected.
- Alternative n-grams are used as n-grams represented by the nth layer nodes, and there is no need to traverse all n-grams composed of preset word units, which not only greatly reduces the amount of calculation, but also improves computational efficiency, and the frequency of word segmentation is significant.
- the n-gram segment represented by the n-th layer node filtered out based on the gender distribution information is more reasonable.
- the present specification also provides an embodiment of an apparatus for estimating word segmentation frequency in differential privacy-preserving data.
- the apparatus shown in FIG. 5 is applied to a server, and the apparatus may include: an acquiring module 501 , a grouping module 502 , a determining module 503 and a generating module 504 .
- the obtaining module 501 is configured to obtain each word segmentation information reported by the terminal device and processed by local differential privacy. Any word segmentation information corresponds to a word segmentation, and includes a target number representing the number of word units included in the word segmentation, where the target number is less than or equal to a preset value N.
- the grouping module 502 is configured to divide N groups of word information, so that each word segmentation information in the same group corresponds to the same target number.
- the determining module 503 is configured to determine each group of estimated data representing the unbiased estimation of the word segmentation frequency corresponding to each component word information.
- the generation module 504 is used to generate, layer by layer, nodes of each layer of the prefix tree used to record the frequency of word segmentation based on each group of estimated data, wherein the generation module 504 generates the nth layer node in the following manner: obtain the n-1th layer of nodes.
- Each n-1-gram segment represented by each node of , and the n-1-gram segment represented by any node in the n-1th layer is formed by sequentially arranging from the root node to the word unit corresponding to the node.
- each n-1-gram segment determine multiple candidate n-gram segments for the n-th layer node, and calculate the frequency saliency of the candidate n-gram segment based on the n-th group of estimated data corresponding to the target number of n Distribution information, based on the above frequency saliency distribution information, select several candidate n-grams as the n-grams represented by the nth layer nodes, and use each node in the nth layer to record the frequency of each n-gram represented by each node. , 1 ⁇ n ⁇ N.
- the root node of the prefix tree is a node at level 0, and the node at level 0 represents a null character.
- the generating module 504 determines a plurality of candidate n-grams for the n-th layer node based on each n-1-gram in the following manner: using each n-1-gram as a prefix, respectively with Multiple n-grams formed by each preset word unit in the preset dictionary are determined as the multiple candidate n-grams.
- the generating module 504 calculates the frequency saliency distribution information of the candidate n-gram word segmentation based on the nth group of estimated data corresponding to the target number n in the following manner: based on the nth group of estimated data, Each frequency of each candidate n-gram segment is calculated, based on each frequency, each corresponding variance of each candidate n-gram segment is calculated, and based on each variance, frequency significance distribution information of the candidate n-gram segment is calculated.
- the generation module 504 calculates the frequency saliency distribution information of the candidate n-grams based on the variances in the following manner: based on the variances, calculates the respective z values corresponding to the candidate n-grams, Based on each z value, each p value corresponding to each candidate n-gram is calculated as the frequency saliency distribution information of the candidate n-gram.
- the generating module 404 selects several candidate n-grams as the n-grams represented by the n-th layer nodes based on the frequency saliency distribution information in the following manner: Based on each p value, selects several candidate n-grams as the n-th level The n-gram participle represented by the node.
- the generation module 504 selects several candidate n-grams as the n-grams represented by the nth layer nodes based on each p-value in the following manner: arranging each p-value in ascending order, The maximum p-value that satisfies the preset conditions is selected as the target p-value. Any p value that satisfies the preset condition is less than or equal to the target result corresponding to the p value, and the target result is the product of the sequence number of the p value in the arrangement and the preset threshold set for the nth layer divided by the alternative n yuan The result of the number of participles. Each candidate n-gram segment corresponding to each p value smaller than the target p value is selected as the n-gram segment represented by the nth layer node.
- the generating module 504 is further configured to: record the variance and p value of each n-gram segment represented by each node in the nth layer.
- any word segmentation information may further include a target vector representing the word segmentation, and the target vector is processed by local differential privacy.
- the target vector representing the word segmentation is obtained in the following manner: selecting a hash function from a plurality of preset hash functions as the target hash function, and using the target hash function to calculate the target of the word segmentation Hash value, the target vector is determined based on the target hash value in a way that satisfies differential privacy.
- the above-mentioned apparatus may be preset in the server, or may be loaded into the server by means of downloading or the like.
- Corresponding modules in the above-mentioned apparatus can cooperate with modules in the server to realize the solution of estimating the frequency of word segmentation in the differential privacy protection data.
- One or more embodiments of the present specification further provide a computer-readable storage medium, where the storage medium stores a computer program, and the computer program can be used to execute the estimated differential privacy protection data provided by any of the foregoing embodiments in FIG. 2 to FIG. 3 . method of word frequency.
- the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, and of course, may also include hardware required by other services.
- the processor reads the corresponding computer program from the non-volatile memory into the memory and runs it, forming a device for estimating the frequency of word segmentation in the differential privacy protection data at the logical level.
- one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc. That is to say, the execution subjects of the following processing procedures are not limited to each Logic unit, which can also be hardware or logic device.
- the software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or technical fields in any other form of storage medium known in the art.
- RAM random access memory
- ROM read only memory
- electrically programmable ROM electrically erasable programmable ROM
- registers hard disks, removable disks, CD-ROMs, or technical fields in any other form of storage medium known in the art.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (20)
- 一种估计差分隐私保护数据中分词频度的方法,应用于服务器,所述方法包括:获取终端设备上报的、经本地差分隐私处理的各个分词信息;任一分词信息对应于一个分词,并包括表示该分词中包含的词语单元数量的目标个数,该目标个数小于等于预设值N;划分出N组分词信息,使同组的各个分词信息对应于相同的目标个数;确定各组分词信息各自对应的表示分词频度无偏估计的各组估计数据;基于所述各组估计数据,逐层生成用于记录分词频度的前缀树的各层节点,其中,生成第n层节点包括:获取第n-1层中的各个节点各自表示的各个n-1元分词,第n-1层中的任一节点表示的n-1元分词通过将从根节点到该节点所对应的词语单元顺序排列而形成;基于所述各个n-1元分词,确定用于第n层节点的多个备选n元分词;基于对应于目标个数为n的第n组估计数据,计算所述备选n元分词的频度显著性分布信息;基于所述频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词,并利用第n层中的各个节点记录各自表示的各个n元分词的频度;1≤n≤N。
- 根据权利要求1所述的方法,其中,所述前缀树的根节点为第0层节点,所述第0层节点表示空字符。
- 根据权利要求1所述的方法,其中,所述基于所述各个n-1元分词,确定用于第n层节点的多个备选n元分词,包括:将所述各个n-1元分词作为前缀,分别与预设词典中的各个预设词语单元构成的多个n元分词确定为所述多个备选n元分词。
- 根据权利要求1所述的方法,其中,所述基于对应于目标个数为n的第n组估计数据,计算所述备选n元分词的频度显著性分布信息,包括:基于所述第n组估计数据,计算各个备选n元分词的各个频度;基于所述各个频度,计算所述各个备选n元分词各自对应的各个方差;基于所述各个方差,计算所述备选n元分词的频度显著性分布信息。
- 根据权利要求4所述的方法,其中,所述基于所述各个方差,计算所述备选n元分词的频度显著性分布信息,包括:基于所述各个方差,计算所述各个备选n元分词各自对应的各个z值;基于所述各个z值,计算所述各个备选n元分词各自对应的各个p值,作为所述备选n元分词的频度显著性分布信息;其中,所述基于所述频度显著性分布信息,选择若干备选n元分词作为第n层节点 表示的n元分词,包括:基于所述各个p值,选择若干备选n元分词作为第n层节点表示的n元分词。
- 根据权利要求5所述的方法,其中,所述基于所述各个p值,选择若干备选n元分词作为第n层节点表示的n元分词,包括:将所述各个p值按从小到大的顺序进行排列;选取满足预设条件的最大p值作为目标p值;满足所述预设条件的任一p值小于等于该p值对应的目标结果,该目标结果为该p值在所述排列中的序号和针对第n层设定的预设阈值的乘积除以备选n元分词的个数所得结果;选取小于所述目标p值的各个p值各自对应的各个备选n元分词,作为所述第n层节点表示的n元分词。
- 根据权利要求4-6中任一项所述的方法,其中,所述方法还包括:利用第n层中的各个节点记录各自表示的各个n元分词的方差以及p值。
- 根据权利要求1所述的方法,其中,所述任一分词信息还包括表示该分词的目标向量,该目标向量经过本地差分隐私处理。
- 根据权利要求8所述的方法,其中,所述表示该分词的目标向量通过如下方式得到:从多个预设的哈希函数中选取一个哈希函数作为目标哈希函数;利用所述目标哈希函数计算该分词的目标哈希值;利用满足差分隐私的方式,基于所述目标哈希值确定所述目标向量。
- 一种估计差分隐私保护数据中分词频度的装置,应用于服务器,所述装置包括:获取模块,用于获取终端设备上报的、经本地差分隐私处理的各个分词信息;任一分词信息对应于一个分词,并包括表示该分词中包含的词语单元数量的目标个数,该目标个数小于等于预设值N;分组模块,用于划分出N组分词信息,使同组的各个分词信息对应于相同的目标个数;确定模块,用于确定各组分词信息各自对应的表示分词频度无偏估计的各组估计数据;生成模块,用于基于所述各组估计数据,逐层生成用于记录分词频度的前缀树的各层节点,其中,所述生成模块通过如下方式生成第n层节点:获取第n-1层中的各个节点各自表示的各个n-1元分词,第n-1层中的任一节点表示的n-1元分词通过将从根节 点到该节点所对应的词语单元顺序排列而形成;基于所述各个n-1元分词,确定用于第n层节点的多个备选n元分词;基于对应于目标个数为n的第n组估计数据,计算所述备选n元分词的频度显著性分布信息;基于所述频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词,并利用第n层中的各个节点记录各自表示的各个n元分词的频度;1≤n≤N。
- 根据权利要求10所述的装置,其中,所述前缀树的根节点为第0层节点,所述第0层节点表示空字符。
- 根据权利要求10所述的装置,其中,所述生成模块通过如下方式基于所述各个n-1元分词,确定用于第n层节点的多个备选n元分词:将所述各个n-1元分词作为前缀,分别与预设词典中的各个预设词语单元构成的多个n元分词确定为所述多个备选n元分词。
- 根据权利要求10所述的装置,其中,所述生成模块通过如下方式基于对应于目标个数为n的第n组估计数据,计算所述备选n元分词的频度显著性分布信息:基于所述第n组估计数据,计算各个备选n元分词的各个频度;基于所述各个频度,计算所述各个备选n元分词各自对应的各个方差;基于所述各个方差,计算所述备选n元分词的频度显著性分布信息。
- 根据权利要求13所述的装置,其中,所述生成模块通过如下方式基于所述各个方差,计算所述备选n元分词的频度显著性分布信息:基于所述各个方差,计算所述各个备选n元分词各自对应的各个z值;基于所述各个z值,计算所述各个备选n元分词各自对应的各个p值,作为所述备选n元分词的频度显著性分布信息;其中,所述生成模块通过如下方式基于所述频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词:基于所述各个p值,选择若干备选n元分词作为第n层节点表示的n元分词。
- 根据权利要求14所述的装置,其中,所述生成模块通过如下方式基于所述各个p值,选择若干备选n元分词作为第n层节点表示的n元分词:将所述各个p值按从小到大的顺序进行排列;选取满足预设条件的最大p值作为目标p值;满足所述预设条件的任一p值小于等于该p值对应的目标结果,该目标结果为该p值在所述排列中的序号和针对第n层设定的预设阈值的乘积除以备选n元分词的个数所得结果;选取小于所述目标p值的各个p值各自对应的各个备选n元分词,作为所述第n层 节点表示的n元分词。
- 根据权利要求13-15中任一项所述的装置,其中,所述生成模块还被配置用于:利用第n层中的各个节点记录各自表示的各个n元分词的方差以及p值。
- 根据权利要求10所述的装置,其中,所述任一分词信息还包括表示该分词的目标向量,该目标向量经过本地差分隐私处理。
- 根据权利要求17所述的装置,其中,所述表示该分词的目标向量通过如下方式得到:从多个预设的哈希函数中选取一个哈希函数作为目标哈希函数;利用所述目标哈希函数计算该分词的目标哈希值;利用满足差分隐私的方式,基于所述目标哈希值确定所述目标向量。
- 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述权利要求1-9中任一项所述的方法。
- 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述权利要求1-9中任一项所述的方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/275,995 US20240104304A1 (en) | 2021-02-05 | 2022-01-25 | Methods and apparatuses for estimating word segment frequency in differential privacy protection data |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110161186.9A CN112507710B (zh) | 2021-02-05 | 2021-02-05 | 估计差分隐私保护数据中分词频度的方法及装置 |
CN202110161186.9 | 2021-02-05 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022166676A1 true WO2022166676A1 (zh) | 2022-08-11 |
Family
ID=74952724
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/073677 WO2022166676A1 (zh) | 2021-02-05 | 2022-01-25 | 估计差分隐私保护数据中分词频度的方法及装置 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240104304A1 (zh) |
CN (1) | CN112507710B (zh) |
WO (1) | WO2022166676A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112507710B (zh) * | 2021-02-05 | 2021-05-25 | 支付宝(杭州)信息技术有限公司 | 估计差分隐私保护数据中分词频度的方法及装置 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108280366A (zh) * | 2018-01-17 | 2018-07-13 | 上海理工大学 | 一种基于差分隐私的批量线性查询方法 |
CN109829320A (zh) * | 2019-01-14 | 2019-05-31 | 珠海天燕科技有限公司 | 一种信息的处理方法和装置 |
US10878174B1 (en) * | 2020-06-24 | 2020-12-29 | Starmind Ag | Advanced text tagging using key phrase extraction and key phrase generation |
CN112507710A (zh) * | 2021-02-05 | 2021-03-16 | 支付宝(杭州)信息技术有限公司 | 估计差分隐私保护数据中分词频度的方法及装置 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110727958B (zh) * | 2019-10-15 | 2023-04-28 | 南京航空航天大学 | 一种基于前缀树的差分隐私轨迹数据保护方法 |
-
2021
- 2021-02-05 CN CN202110161186.9A patent/CN112507710B/zh active Active
-
2022
- 2022-01-25 US US18/275,995 patent/US20240104304A1/en active Pending
- 2022-01-25 WO PCT/CN2022/073677 patent/WO2022166676A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108280366A (zh) * | 2018-01-17 | 2018-07-13 | 上海理工大学 | 一种基于差分隐私的批量线性查询方法 |
CN109829320A (zh) * | 2019-01-14 | 2019-05-31 | 珠海天燕科技有限公司 | 一种信息的处理方法和装置 |
US10878174B1 (en) * | 2020-06-24 | 2020-12-29 | Starmind Ag | Advanced text tagging using key phrase extraction and key phrase generation |
CN112507710A (zh) * | 2021-02-05 | 2021-03-16 | 支付宝(杭州)信息技术有限公司 | 估计差分隐私保护数据中分词频度的方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN112507710A (zh) | 2021-03-16 |
US20240104304A1 (en) | 2024-03-28 |
CN112507710B (zh) | 2021-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7241862B2 (ja) | 機械学習モデルを使用した、偏りのあるデータの拒否 | |
Wang et al. | Penalized generalized estimating equations for high-dimensional longitudinal data analysis | |
TWI658420B (zh) | 融合時間因素之協同過濾方法、裝置、伺服器及電腦可讀存儲介質 | |
Fan et al. | Asymptotic equivalence of regularization methods in thresholded parameter space | |
WO2018059302A1 (zh) | 文本识别方法、装置及存储介质 | |
WO2019238125A1 (zh) | 信息处理方法、相关设备及计算机存储介质 | |
WO2022166676A1 (zh) | 估计差分隐私保护数据中分词频度的方法及装置 | |
US11782991B2 (en) | Accelerated large-scale similarity calculation | |
US20220270299A1 (en) | Enabling secure video sharing by exploiting data sparsity | |
Tian et al. | Variable selection in the high-dimensional continuous generalized linear model with current status data | |
CN113409827B (zh) | 基于局部卷积块注意力网络的语音端点检测方法及系统 | |
KR102339723B1 (ko) | Dna 저장 장치의 연성 정보 기반 복호화 방법, 프로그램 및 장치 | |
US20210357955A1 (en) | User search category predictor | |
Zhou et al. | A novel locality-sensitive hashing algorithm for similarity searches on large-scale hyperspectral data | |
CN108470181B (zh) | 一种基于加权序列关系的Web服务替换方法 | |
Lai et al. | Estimation and variable selection for generalised partially linear single-index models | |
US20160275169A1 (en) | System and method of generating initial cluster centroids | |
US20210056586A1 (en) | Optimizing large scale data analysis | |
JP2010108488A (ja) | 集計システム、集計処理装置、情報提供者端末、集計方法およびプログラム | |
CN115578583B (zh) | 图像处理方法、装置、电子设备和存储介质 | |
CN114239603A (zh) | 业务需求匹配方法、装置、计算机设备和存储介质 | |
CN117319475A (zh) | 通信资源推荐方法、装置、计算机设备和存储介质 | |
CN117312892A (zh) | 用户聚类方法、装置、计算机设备和存储介质 | |
Chen et al. | Particle swarm stepwise (PaSS) algorithm for information criteria-based variable selections | |
CN115017982A (zh) | 无监督文本聚类方法、装置、计算机设备、存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22748955 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18275995 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 11202305915S Country of ref document: SG |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22748955 Country of ref document: EP Kind code of ref document: A1 |