WO2022166676A1 - 估计差分隐私保护数据中分词频度的方法及装置 - Google Patents

估计差分隐私保护数据中分词频度的方法及装置 Download PDF

Info

Publication number
WO2022166676A1
WO2022166676A1 PCT/CN2022/073677 CN2022073677W WO2022166676A1 WO 2022166676 A1 WO2022166676 A1 WO 2022166676A1 CN 2022073677 W CN2022073677 W CN 2022073677W WO 2022166676 A1 WO2022166676 A1 WO 2022166676A1
Authority
WO
WIPO (PCT)
Prior art keywords
grams
word
node
candidate
frequency
Prior art date
Application number
PCT/CN2022/073677
Other languages
English (en)
French (fr)
Inventor
吴若凡
石磊磊
陈永环
朱耀伟
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Priority to US18/275,995 priority Critical patent/US20240104304A1/en
Publication of WO2022166676A1 publication Critical patent/WO2022166676A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Definitions

  • One or more embodiments of this specification relate to the technical field of data mining, and in particular, to a method and apparatus for estimating the frequency of word segmentation in differential privacy-preserving data.
  • the text information (such as messages, chat records, search records, etc.) entered or viewed by users through terminal devices can directly or indirectly reflect the characteristics and preferences of users. These text information is extremely important for data mining and analysis. However, these text information also involves the user's personal privacy. Therefore, it is usually possible to perform local differential privacy processing on the text information input or viewed by the user through the terminal device to obtain differential privacy protection data, and report the differential privacy protection data to the server, and the server estimates the word segmentation frequency (word segmentation frequency) in the differential privacy protection data. occurrences in the text). Therefore, in the process of estimating the frequency of word segmentation, how to estimate the frequency of word segmentation more efficiently and reasonably with less calculation amount becomes particularly important in the field of data mining.
  • word segmentation frequency word segmentation frequency
  • one or more embodiments of this specification provide a method and apparatus for estimating the frequency of word segmentation in differential privacy protection data.
  • a method for estimating word segmentation frequency in differential privacy protection data is provided, which is applied to a server, including: acquiring each word segmentation information reported by a terminal device and processed by local differential privacy; any word segmentation information corresponds to a word segmentation, And include the target number representing the number of word units contained in the word segmentation, the target number is less than or equal to the preset value N; divide N groups of word information, so that each word segmentation information in the same group corresponds to the same target number; determine Each group of word information corresponding to each group of estimation data representing the unbiased estimation of word segmentation frequency; based on each group of estimation data, each layer node of the prefix tree for recording word segmentation frequency is generated layer by layer, wherein, the nth The layer nodes include: obtaining each n-1-gram segment represented by each node in the n-1th layer, and the n-1-gram segment represented by any node in the n-1th layer is passed from the root node to the node.
  • Corresponding word units are arranged in sequence; based on the respective n-1-gram word segments, multiple candidate n-gram word segments for the n-th layer node are determined; based on the n-th group of estimated data corresponding to the target number n, Calculate the frequency saliency distribution information of the candidate n-grams; based on the frequency saliency distribution information, select several candidate n-grams as the n-grams represented by the nth layer nodes, and use the n-grams in the nth layer. Each node of , records the frequency of each n-gram segment represented by each node; 1 ⁇ n ⁇ N.
  • the root node of the prefix tree is a level 0 node, and the level 0 node represents a null character.
  • the determining a plurality of candidate n-grams for the n-th layer node based on the respective n-1-grams includes: using the n-1-grams as prefixes, which are respectively associated with presets. Multiple n-grams formed by each preset word unit in the dictionary are determined as the multiple candidate n-grams.
  • calculating the frequency saliency distribution information of the candidate n-gram word segmentation based on the nth group of estimated data corresponding to the target number of n includes: based on the nth group of estimated data, calculating each The respective frequencies of the candidate n-grams; based on the respective frequencies, calculate the respective variances corresponding to the respective alternative n-grams; based on the respective variances, calculate the frequency of the candidate n-grams is significant Sex distribution information.
  • calculating the frequency saliency distribution information of the candidate n-grams based on the respective variances includes: calculating, based on the respective variances, respective z corresponding to the respective candidate n-grams. value; based on the respective z values, calculate the respective p values corresponding to the respective candidate n-grams, as the frequency significance distribution information of the candidate n-grams; wherein, based on the frequency
  • selecting several candidate n-grams as the n-grams represented by the n-th layer nodes includes: selecting several candidate n-grams as the n-grams represented by the n-th level nodes based on the respective p values.
  • selecting several candidate n-grams based on the respective p-values as the n-grams represented by the nth layer nodes includes: arranging the respective p-values in ascending order; The maximum p value of the preset condition is used as the target p value; any p value that satisfies the preset condition is less than or equal to the target result corresponding to the p value, and the target result is the sequence number of the p value in the arrangement and the target result for the first The result obtained by dividing the product of the preset thresholds set by n layers by the number of candidate n-grams; selecting each candidate n-gram that corresponds to each p value smaller than the target p value as the nth layer The n-gram participle represented by the node.
  • the method further includes: using each node in the nth layer to record the variance and p value of each n-gram segment represented by each node.
  • any word segmentation information further includes a target vector representing the word segmentation, and the target vector is processed by local differential privacy.
  • the target vector representing the word segmentation is obtained in the following manner: a hash function is selected from a plurality of preset hash functions as the target hash function; the target hash function is used to calculate the target of the word segmentation. hash value;
  • the target vector is determined based on the target hash value in a manner that satisfies differential privacy.
  • an apparatus for estimating the frequency of word segmentation in differential privacy protection data which is applied to a server and includes: an acquisition module for acquiring each word segmentation information reported by a terminal device and processed by local differential privacy; any word segmentation The information corresponds to a word segment, and includes a target number representing the number of word units contained in the word segment, and the target number is less than or equal to the preset value N; the grouping module is used to divide N groups of word information, so that each The word segmentation information corresponds to the same target number; the determining module is used to determine each group of estimated data corresponding to each component word information representing the unbiased estimation of the word segmentation frequency; the generating module is used for each group of estimated data.
  • Layer generation is used to record each layer node of the prefix tree of word segmentation frequency, wherein, the generation module generates the nth layer node in the following way: obtain each n-1 yuan word segmentation represented by each node in the n-1th layer.
  • the n-1 metaword represented by any node in the n-1th layer is formed by sequentially arranging the word units corresponding to the node from the root node; Multiple candidate n-gram word segmentations of n-level nodes; based on the nth group of estimated data corresponding to the target number n, calculate the frequency saliency distribution information of the candidate n-gram word segmentation; based on the frequency saliency Distribution information, select several candidate n-grams as the n-grams represented by the nth layer nodes, and use each node in the nth layer to record the frequency of each n-grams represented by each node; 1 ⁇ n ⁇ N.
  • the root node of the prefix tree is a level 0 node, and the level 0 node represents a null character.
  • the generation module determines a plurality of candidate n-grams for the n-th layer node based on the respective n-1-grams in the following manner: using the n-1-grams as prefixes, respectively. Multiple n-grams formed with each preset word unit in the preset dictionary are determined as the multiple candidate n-grams.
  • the generation module calculates the frequency saliency distribution information of the candidate n-gram word segmentation based on the nth group of estimated data corresponding to the target number n in the following manner: based on the nth group of estimated data , calculate each frequency of each candidate n-gram segment; based on the respective frequencies, calculate the respective variances corresponding to each of the candidate n-grams; based on the respective variances, calculate the Frequency significance distribution information.
  • the generation module calculates the frequency significance distribution information of the candidate n-grams based on the respective variances in the following manner: based on the respective variances, calculates the corresponding corresponding to each of the candidate n-grams.
  • the generation module selects several candidate n-grams as the n-grams represented by the nth layer nodes based on the respective p-values in the following manner: Arrange the p-values in ascending order. Select the maximum p value that satisfies the preset condition as the target p value; any p value that satisfies the preset condition is less than or equal to the target result corresponding to the p value, and the target result is the sequence number of the p value in the arrangement and the result obtained by dividing the product of the preset threshold value set for the nth layer by the number of alternative n-grams; select each alternative n-grams corresponding to each p value less than the target p value, as the The n-gram segment represented by the nth level node.
  • the generating module is further configured to: use each node in the nth layer to record the variance and p value of each n-gram segment represented by each node.
  • any word segmentation information further includes a target vector representing the word segmentation, and the target vector is processed by local differential privacy.
  • the target vector representing the word segmentation is obtained in the following manner: a hash function is selected from a plurality of preset hash functions as the target hash function; the target hash function is used to calculate the target of the word segmentation. Hash value; determine the target vector based on the target hash value in a manner that satisfies differential privacy.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the method according to any one of the above-mentioned first aspects.
  • an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing any one of the first aspects when executing the program the method described.
  • the technical solutions provided by the embodiments of this specification may include the following beneficial effects: the method and apparatus for estimating the frequency of word segmentation in differential privacy protection data provided by the embodiments of this specification
  • Word segmentation information divide N groups of word information, so that each word segmentation information in the same group corresponds to the same target number, determine each group of estimated data corresponding to each component word information that represents the unbiased estimation of word segmentation frequency, and based on each group Estimate the data, and generate the nodes of each layer of the prefix tree for recording the frequency of word segmentation.
  • this embodiment can select some candidate n-grams as the n-grams represented by the n-level nodes based on the frequency saliency distribution information of the candidate n-grams in the process of generating the nth level node of the prefix tree, There is no need to traverse all the n-grams composed of preset word units, which not only greatly reduces the amount of calculation, but also improves the computational efficiency, and the n-grams represented by the n-th layer nodes selected based on the frequency saliency distribution information of word segmentation. more reasonable.
  • FIG. 1 is a schematic diagram of a scenario for estimating the frequency of word segmentation in differential privacy protection data according to an exemplary embodiment of the present specification
  • FIG. 2 is a flowchart of a method for estimating word segmentation frequency in differential privacy protection data according to an exemplary embodiment of the present specification
  • FIG. 3 is a flowchart of another method for estimating the frequency of word segmentation in differential privacy protection data shown in this specification according to an exemplary embodiment
  • FIG. 4 is a block diagram of an apparatus for estimating word segmentation frequency in differential privacy protection data according to an exemplary embodiment of the present specification
  • FIG. 5 is a schematic diagram of a prefix tree shown in this specification according to an exemplary embodiment
  • FIG. 6 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present description.
  • first, second, third, etc. may be used in this application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application.
  • word "if” as used herein can be interpreted as "at the time of” or "when” or "in response to determining.”
  • each word segmentation consists of one or more It is composed of multiple word units (for example, each participle in this scenario is composed of at most 4 word units).
  • the terminal device performs local differential privacy processing on each obtained word segment, obtains each target vector corresponding to each word segment, and generates each corresponding word segment information for each word segment.
  • the word segmentation information corresponding to any word segmentation may include the target vector corresponding to the word segmentation and the target number of word units constituting the word segmentation.
  • the terminal device reports the obtained word segmentation information to the server.
  • the server receives the word segmentation information reported by a plurality of terminal devices, and aggregates the received word segmentation information and groups them, so that the number of targets corresponding to the same grouped word information is equal. For example, the word segmentation information corresponding to the word segmentation consisting of one word unit is grouped into one group as the first group word information. The word segmentation information corresponding to the word segmentation composed of 2 word units is divided into a group as the second component word information, and so on. In this scenario, a total of 4 words including the third component word information and the fourth component word information can be obtained. Component word information.
  • each group of estimated data representing the unbiased estimation of the word segmentation frequency corresponding to each component word information is determined.
  • the first set of estimated data may be determined based on the first set of word information, and the first set of estimated data may represent an unbiased estimate of the frequency of a word segment composed of one word unit.
  • the second group of estimated data can be determined based on the second group of word information.
  • the second group of estimated data can represent the unbiased estimation of the frequency of the word segmentation composed of two word units, and so on.
  • the third group can be obtained.
  • a prefix tree for recording the frequency of word segmentation can be generated and output as a result of word segmentation frequency estimation.
  • the root node of the prefix tree may be generated first as a layer 0 node.
  • each node of the nth layer is obtained based on the nth group of estimated data.
  • Each node of the nth layer corresponds to an n-gram segment consisting of n word units, and the corresponding n-gram is recorded in each node of the nth layer. frequency of word segmentation.
  • each node of the first layer can be obtained based on the first set of estimated data.
  • Each node of the first layer corresponds to a unigram consisting of one word unit, and the corresponding 1 is recorded in each node of the first layer.
  • Each node of the second layer can be obtained based on the second set of estimated data.
  • Each node of the second layer corresponds to a 2-gram word segment composed of 2 word units, and the corresponding 2-gram word segment is recorded in each node of the second layer.
  • the frequency of , and so on, in this scenario, the prefix tree also includes the third-level and fourth-level nodes.
  • the method shown in FIG. 2 can be applied to a server, and the server can be implemented as any device, platform, server or device cluster with computing and processing capabilities.
  • the method includes the following steps: in step 201, each word segmentation information reported by a terminal device and processed by local differential privacy is obtained.
  • the involved terminal device may be any terminal device capable of inputting or viewing text information.
  • the terminal device may include, but is not limited to, a mobile terminal device such as a smart phone, a smart wearable device, a tablet computer, a personal digital assistant, a laptop computer, a desktop computer, and the like.
  • the server may acquire multiple pieces of word segmentation information reported by multiple terminal devices.
  • Any piece of word segmentation information corresponds to a word segmentation consisting of one or more word units, and the word segmentation information may include a target vector and a target number, where the target vector represents the corresponding word segmentation and undergoes local differential privacy processing, and the target number represents the composition of the word segmentation.
  • the number of word units of , the target number is less than or equal to the preset value N.
  • a preset word segmentation operation may be performed on the text information input or viewed by the user based on a preset dictionary, so as to disassemble the text information into the preset dictionary word unit. Then, according to the disassembled word units, multiple word segments are combined, and each word segment is composed of one or more word units. In addition, the number of word units constituting each participle is less than or equal to the preset value N.
  • the word unit may be a word or a word.
  • the terminal device performs local differential privacy processing on each word segment to obtain a target vector corresponding to each word segment.
  • the target vector corresponding to any participle can be obtained in the following manner: First, randomly select a hash function from a plurality of preset hash functions as the target hash function. The target hash function is used to perform hash calculation on the segmented word to obtain the target hash value of the segmented word. Finally, the target vector is determined based on the target hash value in a way that satisfies differential privacy.
  • the character string corresponding to the segmented word is S.
  • a hash function H j may be randomly selected from k preset hash functions H 1 , H 2 . . . H k as the target hash function. Among them, j is the number corresponding to H j .
  • a random vector v is generated, and the random vector v takes the value of 1 or -1 in each dimension, and the probability of the value of -1 in each dimension is:
  • is the preset privacy budget, which is used to represent the privacy protection level.
  • step 202 N groups of word information are divided, so that each word segmentation information in the same group corresponds to the same target number.
  • multiple pieces of word segmentation information from multiple terminal devices can be aggregated and grouped to divide N groups of word information, so that each word segmentation information in the same group corresponds to the same target number.
  • the preset value N 3, three groups of word information G1, G2 and G3 can be separated, wherein the number of targets corresponding to each word segmentation information in the G1 group is 1, and the target number corresponding to each word segmentation information in the G2 group The number is 2, and the target number corresponding to each word segmentation information in the G3 group is 3.
  • step 203 each group of estimated data corresponding to each component word information and representing the unbiased estimation of the word segmentation frequency is determined.
  • the estimated data representing the unbiased estimation of word segmentation frequency corresponding to the piece of word information can be determined. Because N groups of word information are divided, N groups of estimated data can be obtained, and each estimated data of the same group also corresponds to the same target number.
  • any piece of word segmentation information may include a target vector and the number of targets, wherein the word segmentation information may also include the number j of the target hash function used to obtain the target vector.
  • the reference vector corresponding to each word segment information in the component word information can be calculated by the following formula:
  • Represents the reference vector corresponding to the i-th word segmentation information is a unit vector with the same dimension as the target vector
  • k is the number of preset hash functions
  • c is a constant
  • c can be expressed as:
  • is a preset privacy budget, which is equal to the privacy budget ⁇ involved in a specific implementation provided in step 201 .
  • the component word information is added according to the number of the corresponding target hash function, and the reference vectors corresponding to the word segmentation information with the same number are added to obtain k vectors. Take these k vectors as row vectors or column vectors, and arrange them in ascending order to obtain the target matrix.
  • the target matrix is a set of estimated data representing the unbiased estimation of the word segmentation frequency corresponding to the component word information.
  • each set of estimated data can be determined in different ways. This embodiment does not limit the specific manner of determining each group of estimated data.
  • each layer of nodes of the prefix tree for recording the frequency of word segmentation is generated layer by layer.
  • each layer of nodes of the prefix tree for recording the frequency of word segmentation may be generated layer by layer based on each set of estimated data, thereby obtaining the prefix tree. Specifically, first, the root node of the prefix tree is generated, and the root node of the prefix tree is taken as the 0th level node, and the 0th level node represents a null character. Next, starting from the first layer, the nodes of each layer of the prefix tree are generated layer by layer.
  • Step a Obtain each n-1 represented by each node in the n-1th layer meta-participle.
  • each n-1-gram word segment represented by each node in the n-1th layer is obtained.
  • the n-1-gram word segment represented by any node in the n-1th layer is formed by sequentially arranging word units corresponding to the node from the root node.
  • the 0-element word segment represented by the node at the 0th layer that is, the null character, is obtained.
  • Step b Based on each of the n-1-grams, determine multiple candidate n-grams for the nth layer node.
  • a plurality of candidate n-grams for the n-th layer node may be determined based on each n-1-gram. Specifically, each n-1-gram segment may be used as a prefix to be combined with each preset word unit in the preset dictionary, and a plurality of n-gram word segments thus formed may be determined as a plurality of candidate n-gram word segments.
  • the n-1-element word segmentation represented by the 2 nodes in the n-1 layer node is x and w respectively (x and w are both n-1-element word segmentation composed of n-1 word units)
  • the formed n-grams are xA, xB, xC, prefixed with w, combined with A, B, C respectively
  • the formed n-grams are wA, wB, and wC.
  • xA, xB, xC, wA, wB, wC can be determined as alternative n-grams.
  • Step c Calculate the frequency saliency distribution information of the candidate n-gram word segmentation based on the nth group of estimated data corresponding to the target number n.
  • the frequency saliency distribution information of the candidate n-grams may be calculated based on the nth group of estimated data.
  • the number of targets corresponding to the nth group of estimated data is n.
  • the frequency saliency distribution information of the candidate n-grams may be calculated in the following manner: First, each frequency of each candidate n-gram may be calculated based on the nth group of estimated data. Referring to a specific implementation provided in step 203, the target matrix is obtained as a set of estimated data representing the unbiased estimation of word segmentation frequency, and the nth matrix is the nth set of estimated data. For any candidate n-gram segment D, the above k preset hash functions H 1 , H 2 .
  • each variance corresponding to each candidate n-gram may be calculated based on each frequency of each candidate n-gram, so as to obtain a standard deviation corresponding to each variance.
  • each z value corresponding to each candidate n-gram is calculated. Wherein, for any candidate n-gram segment, the frequency of the candidate n-gram segment is divided by the standard deviation corresponding to the candidate n-gram segment to obtain the z value corresponding to the candidate n-gram segment.
  • each p-value corresponding to each candidate n-gram can be calculated based on the respective z-values corresponding to each candidate n-gram (the p-value corresponding to a sample is extracted from the sample or extracted to a more extreme value than the sample).
  • the probability of the case sample as the frequency significance distribution information corresponding to each candidate n-gram.
  • the p value corresponding to the candidate n-gram segment is the probability that the random variable in the standard normal distribution is greater than the z value corresponding to the candidate n-gram segment.
  • other frequency-based indicators may also be selected as the significance distribution information, for example, the aforementioned z value, or other statistical distribution quantities determined based on the z value, may be used as the significance distribution information.
  • Step d Based on the above frequency saliency distribution information, select several candidate n-grams as the n-grams represented by the nth layer nodes, and use each node in the nth layer to record the frequencies of each n-grams represented by them. .
  • At least one candidate n-gram segment that satisfies a certain condition may be selected from a plurality of candidate n-gram segments, as the n-gram segment represented by the nth layer node, and Use each node in the nth layer to record the frequency of each n-gram segment represented by each node.
  • each node in the nth layer can also be used to record the variance and p-value of each n-gram segment represented by each node.
  • the candidate n-grams include A, B, C, D, E, and F.
  • A, B, and C that satisfy certain conditions can be selected from the candidate n-grams as the first An n-gram segment represented by an n-level node.
  • three nodes a, b, and c of the nth layer are generated, and the three nodes a, b, and c represent A, B, and C, respectively.
  • node a records the frequency, variance, and p value of A
  • node b records the frequency, variance, and p value of B
  • node c records the frequency, variance, and p value of C.
  • the method for estimating the word segmentation frequency in the differential privacy protection data divides N groups of word information into N groups of word segmentation information by acquiring each word segmentation information reported by a terminal device and processed by local differential privacy, so that each word segmentation in the same group The information corresponds to the same target number, and each group of estimated data corresponding to each component word information representing the unbiased estimation of the word segmentation frequency is determined, and based on each group of estimated data, the prefix tree used to record the word segmentation frequency is generated layer by layer. nodes at each level.
  • this embodiment can select some candidate n-grams as the n-grams represented by the n-level nodes based on the frequency saliency distribution information of the candidate n-grams in the process of generating the nth level node of the prefix tree, There is no need to traverse all the n-grams composed of preset word units, which not only greatly reduces the amount of calculation, but also improves the computational efficiency, and the n-grams represented by the n-th layer nodes selected based on the frequency saliency distribution information of word segmentation. more reasonable.
  • the embodiment of FIG. 3 describes the process of selecting several candidate n-grams as the n-grams represented by the nth layer nodes.
  • the method can be applied to the server, and includes the following steps: In step 301, the The respective p-values corresponding to each of the candidate n-grams are arranged in ascending order.
  • step 302 the maximum p value satisfying the preset condition is selected as the target p value.
  • the target result is the result obtained by dividing the product of the sequence number of the p value in the arrangement and the preset threshold set for the nth layer by the number of candidate n-grams.
  • the p-values corresponding to the candidate n-grams are arranged in ascending order, and p i is used to represent the i-th p-value in the arrangement. If p i ⁇ (i/N)* ⁇ n , then p i satisfies the preset condition. Take the largest p-value that satisfies the preset condition as the target p-value.
  • ⁇ n is a preset threshold value set for the nth layer node. Generally speaking, the larger the n is, the larger the ⁇ n is.
  • each candidate n-gram word segment corresponding to each p value smaller than the target p value is selected as the n-gram word segment represented by the nth layer node.
  • the method for estimating the word segmentation frequency in the differential privacy protection data is by arranging the respective p-values corresponding to each candidate n-gram word segmentation in ascending order, and selecting the maximum value that satisfies the preset conditions.
  • the p value is used as the target p value, and each candidate n-gram segment corresponding to each p value smaller than the target p value is selected as the n-gram segment represented by the nth layer node.
  • the n-gram segment represented by the n-th layer node selected based on each p-value of the segment is further reasonable.
  • the above method may further include: using each node in the nth layer to record the variance and p value of each n-gram segment represented by each node.
  • the application scenario may be: the preset dictionary includes word units A, B, C, and D, and the server needs to estimate the frequency of 1-element word segmentation, 2-element word segmentation and 3-element word segmentation in the differential privacy protection data based on the preset dictionary.
  • the server obtains each word segmentation information reported by the terminal device and processed by local differential privacy. And divide the first component word information, the second component word information and the third component word information.
  • the target number corresponding to the first component word information is all 1
  • the target number corresponding to the second component word information is all 2
  • the target number corresponding to the third component word information is all 3.
  • the third group of estimated data corresponding to the information represents the unbiased estimation of the word segmentation frequency.
  • each layer node of the prefix tree used to record the word segmentation frequency is generated layer by layer.
  • the root node of the prefix tree can be generated first as the 0th layer node, and the 0th layer node represents a null character.
  • the word units A, B, C, and D in the preset dictionary are determined as four candidate unary word segments. Calculate the frequency saliency distribution information corresponding to each candidate 1-gram word segmentation respectively. Based on the frequency saliency distribution information corresponding to each candidate 1-gram word segmentation, select A, B, C, and D as the 1-gram represented by the first layer node. Participle.
  • the node a represents the 1-gram word A, and use the node a to record the frequency of the 1-gram word A.
  • the child node b of the root node is constructed, the node b represents the 1-gram segment B, and the frequency of the 1-gram segment B is recorded by using the node b.
  • the child node c of the root node is constructed, the node c represents the 1-gram C, and the frequency of the 1-gram C is recorded by using the node c.
  • the child node d of the root node is constructed, the node d represents the 1-gram segment D, and the frequency of the 1-gram segment D is recorded by the node d.
  • node a, node b, node c, and node d are the nodes of the first layer of the prefix tree.
  • the node bc represents the 2-gram BC, and the frequency of the 2-gram BC is recorded using the node bc.
  • the node bd represents the 2-gram BD, and the frequency of the 2-gram BD is recorded by the node bd.
  • the node ab, the node ac, the node bc, and the node bd are the nodes of the second layer of the prefix tree.
  • the node abb represents the 3-gram ABB, and the frequency of the 3-gram ABB is recorded by the node abb.
  • the node acc represents the 3-gram ACC, and the frequency of the 3-gram ACC is recorded by the node acc.
  • the node acd represents the 3-gram ACD, and the frequency of the 3-gram ACD is recorded by the node acd.
  • the node bdc represents the 3-gram BDC, and the frequency of the 3-gram BDC is recorded by the node bdc.
  • the node bdd represents the 3-gram BDD, and the frequency of the 3-gram BDD is recorded by the node bdd.
  • the node aba, the node abb, the node acc, the node acd, the node bdc, and the node bdd are the nodes of the third layer of the prefix tree.
  • the nodes of each layer of the prefix tree used to record the frequency of word segmentation are generated layer by layer.
  • the frequency saliency distribution information of the candidate n-gram word segmentation can be selected.
  • Alternative n-grams are used as n-grams represented by the nth layer nodes, and there is no need to traverse all n-grams composed of preset word units, which not only greatly reduces the amount of calculation, but also improves computational efficiency, and the frequency of word segmentation is significant.
  • the n-gram segment represented by the n-th layer node filtered out based on the gender distribution information is more reasonable.
  • the present specification also provides an embodiment of an apparatus for estimating word segmentation frequency in differential privacy-preserving data.
  • the apparatus shown in FIG. 5 is applied to a server, and the apparatus may include: an acquiring module 501 , a grouping module 502 , a determining module 503 and a generating module 504 .
  • the obtaining module 501 is configured to obtain each word segmentation information reported by the terminal device and processed by local differential privacy. Any word segmentation information corresponds to a word segmentation, and includes a target number representing the number of word units included in the word segmentation, where the target number is less than or equal to a preset value N.
  • the grouping module 502 is configured to divide N groups of word information, so that each word segmentation information in the same group corresponds to the same target number.
  • the determining module 503 is configured to determine each group of estimated data representing the unbiased estimation of the word segmentation frequency corresponding to each component word information.
  • the generation module 504 is used to generate, layer by layer, nodes of each layer of the prefix tree used to record the frequency of word segmentation based on each group of estimated data, wherein the generation module 504 generates the nth layer node in the following manner: obtain the n-1th layer of nodes.
  • Each n-1-gram segment represented by each node of , and the n-1-gram segment represented by any node in the n-1th layer is formed by sequentially arranging from the root node to the word unit corresponding to the node.
  • each n-1-gram segment determine multiple candidate n-gram segments for the n-th layer node, and calculate the frequency saliency of the candidate n-gram segment based on the n-th group of estimated data corresponding to the target number of n Distribution information, based on the above frequency saliency distribution information, select several candidate n-grams as the n-grams represented by the nth layer nodes, and use each node in the nth layer to record the frequency of each n-gram represented by each node. , 1 ⁇ n ⁇ N.
  • the root node of the prefix tree is a node at level 0, and the node at level 0 represents a null character.
  • the generating module 504 determines a plurality of candidate n-grams for the n-th layer node based on each n-1-gram in the following manner: using each n-1-gram as a prefix, respectively with Multiple n-grams formed by each preset word unit in the preset dictionary are determined as the multiple candidate n-grams.
  • the generating module 504 calculates the frequency saliency distribution information of the candidate n-gram word segmentation based on the nth group of estimated data corresponding to the target number n in the following manner: based on the nth group of estimated data, Each frequency of each candidate n-gram segment is calculated, based on each frequency, each corresponding variance of each candidate n-gram segment is calculated, and based on each variance, frequency significance distribution information of the candidate n-gram segment is calculated.
  • the generation module 504 calculates the frequency saliency distribution information of the candidate n-grams based on the variances in the following manner: based on the variances, calculates the respective z values corresponding to the candidate n-grams, Based on each z value, each p value corresponding to each candidate n-gram is calculated as the frequency saliency distribution information of the candidate n-gram.
  • the generating module 404 selects several candidate n-grams as the n-grams represented by the n-th layer nodes based on the frequency saliency distribution information in the following manner: Based on each p value, selects several candidate n-grams as the n-th level The n-gram participle represented by the node.
  • the generation module 504 selects several candidate n-grams as the n-grams represented by the nth layer nodes based on each p-value in the following manner: arranging each p-value in ascending order, The maximum p-value that satisfies the preset conditions is selected as the target p-value. Any p value that satisfies the preset condition is less than or equal to the target result corresponding to the p value, and the target result is the product of the sequence number of the p value in the arrangement and the preset threshold set for the nth layer divided by the alternative n yuan The result of the number of participles. Each candidate n-gram segment corresponding to each p value smaller than the target p value is selected as the n-gram segment represented by the nth layer node.
  • the generating module 504 is further configured to: record the variance and p value of each n-gram segment represented by each node in the nth layer.
  • any word segmentation information may further include a target vector representing the word segmentation, and the target vector is processed by local differential privacy.
  • the target vector representing the word segmentation is obtained in the following manner: selecting a hash function from a plurality of preset hash functions as the target hash function, and using the target hash function to calculate the target of the word segmentation Hash value, the target vector is determined based on the target hash value in a way that satisfies differential privacy.
  • the above-mentioned apparatus may be preset in the server, or may be loaded into the server by means of downloading or the like.
  • Corresponding modules in the above-mentioned apparatus can cooperate with modules in the server to realize the solution of estimating the frequency of word segmentation in the differential privacy protection data.
  • One or more embodiments of the present specification further provide a computer-readable storage medium, where the storage medium stores a computer program, and the computer program can be used to execute the estimated differential privacy protection data provided by any of the foregoing embodiments in FIG. 2 to FIG. 3 . method of word frequency.
  • the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, and of course, may also include hardware required by other services.
  • the processor reads the corresponding computer program from the non-volatile memory into the memory and runs it, forming a device for estimating the frequency of word segmentation in the differential privacy protection data at the logical level.
  • one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc. That is to say, the execution subjects of the following processing procedures are not limited to each Logic unit, which can also be hardware or logic device.
  • the software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or technical fields in any other form of storage medium known in the art.
  • RAM random access memory
  • ROM read only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs, or technical fields in any other form of storage medium known in the art.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本说明书提供一种估计差分隐私保护数据中分词频度方法、装置及电子设备,根据该方法,获取终端设备上报的、经本地差分隐私处理的各个分词信息;划分出N组分词信息,使同组的各个分词信息对应于相同的目标个数;确定各组分词信息各自对应的表示分词频度无偏估计的各组估计数据;基于所述各组估计数据,逐层生成用于记录分词频度的前缀树的各层节点。其中,在生成前缀树的第n层节点的过程中,选择部分备选n元分词作为第n层节点表示的n元分词,无需遍历由预设的词语单元构成的所有n元分词,不仅大大降低了计算量,提高了计算效率,而且基于分词的频度显著性分布信息而筛选出来的第n层节点表示的n元分词更具合理性。

Description

估计差分隐私保护数据中分词频度的方法及装置 技术领域
本说明书一个或多个实施例涉及数据挖掘领域技术领域,特别涉及用于估计差分隐私保护数据中分词频度的方法及装置。
背景技术
用户通过终端设备输入或者查看的文本信息(如留言,聊天记录,搜索记录等),能够直接或间接地体现出用户的特征和偏好,这些文本信息对于数据的挖掘与分析具有极其重要的意义。但是,这些文本信息又涉及到用户的个人隐私。因此,通常可以通过终端设备对用户输入或者查看的文本信息进行本地差分隐私处理,得到差分隐私保护数据,并将差分隐私保护数据上报至服务器,由服务器估计差分隐私保护数据中分词频度(分词在文本中出现的次数)。因此,在估计分词频度的过程中,如何能够在计算量较小的情况下更高效和合理的对分词频度进行估计,在数据挖掘领域变得尤为重要。
发明内容
为了解决上述技术问题之一,本说明书一个或多个实施例提供一种估计差分隐私保护数据中分词频度的方法及装置。
根据第一方面,提供一种估计差分隐私保护数据中分词频度方法,应用于服务器,包括:获取终端设备上报的、经本地差分隐私处理的各个分词信息;任一分词信息对应于一个分词,并包括表示该分词中包含的词语单元数量的目标个数,该目标个数小于等于预设值N;划分出N组分词信息,使同组的各个分词信息对应于相同的目标个数;确定各组分词信息各自对应的表示分词频度无偏估计的各组估计数据;基于所述各组估计数据,逐层生成用于记录分词频度的前缀树的各层节点,其中,生成第n层节点包括:获取第n-1层中的各个节点各自表示的各个n-1元分词,第n-1层中的任一节点表示的n-1元分词通过将从根节点到该节点所对应的词语单元顺序排列而形成;基于所述各个n-1元分词,确定用于第n层节点的多个备选n元分词;基于对应于目标个数为n的第n组估计数据,计算所述备选n元分词的频度显著性分布信息;基于所述频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词,并利用第n层中的各个节点记录各自表示的各个n元分词的频度;1≤n≤N。
可选的,所述前缀树的根节点为第0层节点,所述第0层节点表示空字符。
可选的,所述基于所述各个n-1元分词,确定用于第n层节点的多个备选n元分词,包括:将所述各个n-1元分词作为前缀,分别与预设词典中的各个预设词语单元构成的多个n元分词确定为所述多个备选n元分词。
可选的,所述基于对应于目标个数为n的第n组估计数据,计算所述备选n元分词的频度显著性分布信息,包括:基于所述第n组估计数据,计算各个备选n元分词的各个频度;基于所述各个频度,计算所述各个备选n元分词各自对应的各个方差;基于所述各个方差,计算所述备选n元分词的频度显著性分布信息。
可选的,所述基于所述各个方差,计算所述备选n元分词的频度显著性分布信息,包括:基于所述各个方差,计算所述各个备选n元分词各自对应的各个z值;基于所述各个z值,计算所述各个备选n元分词各自对应的各个p值,作为所述备选n元分词的频度显著性分布信息;其中,所述基于所述频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词,包括:基于所述各个p值,选择若干备选n元分词作为第n层节点表示的n元分词。
可选的,所述基于所述各个p值,选择若干备选n元分词作为第n层节点表示的n元分词,包括:将所述各个p值按从小到大的顺序进行排列;选取满足预设条件的最大p值作为目标p值;满足所述预设条件的任一p值小于等于该p值对应的目标结果,该目标结果为该p值在所述排列中的序号和针对第n层设定的预设阈值的乘积除以备选n元分词的个数所得结果;选取小于所述目标p值的各个p值各自对应的各个备选n元分词,作为所述第n层节点表示的n元分词。
可选的,所述方法还包括:利用第n层中的各个节点记录各自表示的各个n元分词的方差以及p值。
可选的,所述任一分词信息还包括表示该分词的目标向量,该目标向量经过本地差分隐私处理。
可选的,所述表示该分词的目标向量通过如下方式得到:从多个预设的哈希函数中选取一个哈希函数作为目标哈希函数;利用所述目标哈希函数计算该分词的目标哈希值;
利用满足差分隐私的方式,基于所述目标哈希值确定所述目标向量。
根据第二方面,提供一种估计差分隐私保护数据中分词频度的装置,应用于服务器,包括:获取模块,用于获取终端设备上报的、经本地差分隐私处理的各个分词信息;任 一分词信息对应于一个分词,并包括表示该分词中包含的词语单元数量的目标个数,该目标个数小于等于预设值N;分组模块,用于划分出N组分词信息,使同组的各个分词信息对应于相同的目标个数;确定模块,用于确定各组分词信息各自对应的表示分词频度无偏估计的各组估计数据;生成模块,用于基于所述各组估计数据,逐层生成用于记录分词频度的前缀树的各层节点,其中,所述生成模块通过如下方式生成第n层节点:获取第n-1层中的各个节点各自表示的各个n-1元分词,第n-1层中的任一节点表示的n-1元分词通过将从根节点到该节点所对应的词语单元顺序排列而形成;基于所述各个n-1元分词,确定用于第n层节点的多个备选n元分词;基于对应于目标个数为n的第n组估计数据,计算所述备选n元分词的频度显著性分布信息;基于所述频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词,并利用第n层中的各个节点记录各自表示的各个n元分词的频度;1≤n≤N。
可选的,所述前缀树的根节点为第0层节点,所述第0层节点表示空字符。
可选的,所述生成模块通过如下方式基于所述各个n-1元分词,确定用于第n层节点的多个备选n元分词:将所述各个n-1元分词作为前缀,分别与预设词典中的各个预设词语单元构成的多个n元分词确定为所述多个备选n元分词。
可选的,所述生成模块通过如下方式基于对应于目标个数为n的第n组估计数据,计算所述备选n元分词的频度显著性分布信息:基于所述第n组估计数据,计算各个备选n元分词的各个频度;基于所述各个频度,计算所述各个备选n元分词各自对应的各个方差;基于所述各个方差,计算所述备选n元分词的频度显著性分布信息。
可选的,所述生成模块通过如下方式基于所述各个方差,计算所述备选n元分词的频度显著性分布信息:基于所述各个方差,计算所述各个备选n元分词各自对应的各个z值;基于所述各个z值,计算所述各个备选n元分词各自对应的各个p值,作为所述备选n元分词的频度显著性分布信息;其中,所述生成模块通过如下方式基于所述频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词:基于所述各个p值,选择若干备选n元分词作为第n层节点表示的n元分词。
可选的,所述生成模块通过如下方式基于所述各个p值,选择若干备选n元分词作为第n层节点表示的n元分词:将所述各个p值按从小到大的顺序进行排列;选取满足预设条件的最大p值作为目标p值;满足所述预设条件的任一p值小于等于该p值对应的目标结果,该目标结果为该p值在所述排列中的序号和针对第n层设定的预设阈值的乘积除以备选n元分词的个数所得结果;选取小于所述目标p值的各个p值各自对应的 各个备选n元分词,作为所述第n层节点表示的n元分词。
可选的,所述生成模块还被配置用于:利用第n层中的各个节点记录各自表示的各个n元分词的方差以及p值。
可选的,所述任一分词信息还包括表示该分词的目标向量,该目标向量经过本地差分隐私处理。
可选的,所述表示该分词的目标向量通过如下方式得到:从多个预设的哈希函数中选取一个哈希函数作为目标哈希函数;利用所述目标哈希函数计算该分词的目标哈希值;利用满足差分隐私的方式,基于所述目标哈希值确定所述目标向量。
根据第三方面,提供一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述第一方面中任一项所述的方法。
根据第四方面,提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述第一方面中任一项所述的方法。
本说明书的实施例提供的技术方案可以包括以下有益效果:本说明书的实施例提供的估计差分隐私保护数据中分词频度的方法和装置,通过获取终端设备上报的、经本地差分隐私处理的各个分词信息,划分出N组分词信息,使同组的各个分词信息对应于相同的目标个数,确定各组分词信息各自对应的表示分词频度无偏估计的各组估计数据,并基于各组估计数据,逐层生成用于记录分词频度的前缀树的各层节点。由于本实施例在生成前缀树的第n层节点的过程中,能够基于备选n元分词的频度显著性分布信息,选择部分备选n元分词作为第n层节点表示的n元分词,无需遍历由预设的词语单元构成的所有n元分词,不仅大大降低了计算量,提高了计算效率,而且基于分词的频度显著性分布信息而筛选出来的第n层节点表示的n元分词更具合理性。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其 它的附图。
图1是本说明书根据一示例性实施例示出的一种估计差分隐私保护数据中分词频度的的场景示意图;
图2是本说明书根据一示例性实施例示出的一种估计差分隐私保护数据中分词频度的方法的流程图;
图3是本说明书根据一示例性实施例示出的另一种估计差分隐私保护数据中分词频度的方法的流程图;
图4是本说明书根据一示例性实施例示出的一种估计差分隐私保护数据中分词频度的装置的框图;
图5是本说明书根据一示例性实施例示出的一种前缀树的示意图;
图6是本说明根据一示例性实施例示出的一种电子设备的结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本说明书相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本说明书的一些方面相一致的装置和方法的例子。
在本说明书中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本说明书和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
如图1所示,在图1示出的场景中,用户在持有的终端设备中输入文本信息,终端设备对用户输入的文本信息进行分词处理,得到多个分词,每个分词由一个或多个词语 单元构成(例如,本场景中每个分词最多由4个词语单元构成)。终端设备对得到的各个分词进行本地差分隐私处理,得到各个分词各自对应的各个目标向量,并生成各个分词各自对应的各个分词信息。任意一个分词对应的分词信息可以包括该分词对应的目标向量以及构成该分词的词语单元的目标个数。终端设备将得到的各个分词信息上报至服务器。
服务器接收多个终端设备上报的分词信息,并将接收到的分词信息汇总后进行分组,使得同组分词信息所对应的目标个数相等。例如,由1个词语单元构成的分词所对应的分词信息被分为一组,作为第1组分词信息。由2个词语单元构成的分词所对应的分词信息被分为一组,作为第2组分词信息,以此类推,本场景中可以得到包括第3组分词信息和第4组分词信息的共4组分词信息。
然后,确定各组分词信息各自对应的表示分词频度无偏估计的各组估计数据。例如,可以基于第1组分词信息,确定第1组估计数据,第1组估计数据可以表示由1个词语单元构成的分词的频度无偏估计。可以基于第2组分词信息,确定第2组估计数据,第2组估计数据可以表示由2个词语单元构成的分词的频度无偏估计,以此类推,本场景中可以得到包括第3组估计数据和第4组估计数据的共4组估计数据。
最后,可以基于各组估计数据,生成并输出用于记录分词频度的前缀树,作为分词频度估计的结果。具体来说,可以先生成前缀树的根节点,作为第0层节点。然后,基于第n组估计数据得到第n层的各个节点,第n层的各个节点各自对应于一个由n个词语单元构成的n元分词,在第n层的各个节点中记录对应的n元分词的频度。例如,可以基于第1组估计数据得到第1层的各个节点,第1层的各个节点各自对应于一个由1个词语单元构成的1元分词,在第1层的各个节点中记录对应的1元分词的频度。可以基于第2组估计数据得到第2层的各个节点,第2层的各个节点各自对应于一个由2个词语单元构成的2元分词,在第2层的各个节点中记录对应的2元分词的频度,以此类推,本场景中前缀树还包括第3层和第4层节点。
下面将结合具体的实施例对本说明书提供的方案进行详细描述。
如图2所示,图2示出的该方法可以应用于服务器中,该服务器可以实现为任何具有计算、处理能力的设备、平台、服务器或设备集群。该方法包括以下步骤:在步骤201中,获取终端设备上报的、经本地差分隐私处理的各个分词信息。
在本实施例中,所涉及的终端设备可以是任意能够输入或查看文本信息的终端设备。 本领域技术人员可以理解,该终端设备可以包括但不限于诸如智能手机的移动终端设备、智能穿戴式设备、平板电脑、个人数字助理、膝上型便携式电脑以及台式电脑等等。
在本实施例中,服务器可以获取多个终端设备各自上报的多个分词信息。任意一个分词信息对应于由一个或多个词语单元构成的分词,该分词信息可以包括目标向量和目标个数,其中,目标向量表示对应分词且经过本地差分隐私处理,目标个数表示构成该分词的词语单元的个数,该目标个数小于等于预设值N。
具体来说,在本实施例中,针对任意一个终端设备,首先,可以基于预设词典对用户输入或查看的文本信息进行预设的分词操作,以将该文本信息拆解成预设词典中的词语单元。再根据拆解的词语单元组合出多个分词,每个分词由一个或多个词语单元构成。并且,构成各个分词的词语单元的个数小于等于预设值N。其中,词语单元可以是字,也可以是词语。
例如,用户输入文本信息“以太坊中有两种不同的数据类型”,预设值N=3,可以将上述文本信息拆解成以下词语单元:以太坊、中、有、不同的、数据和类型。然后,再根据拆解的词语单元组合出以下分词:以太坊、中、有、不同的、数据、类型、以太坊|中、中|有、有|不同的、不同的|数据、数据|类型、以太坊|中|有、中|有|不同的、有|不同的|数据、不同的|数据|类型,其中,符号“|”表示将词语单元分隔的符号。
接着,该终端设备得到上述各个分词后,对各个分词进行本地差分隐私处理,得到各个分词各自对应的目标向量。任一分词对应的目标向量可以通过如下方式得到:首先,从多个预设的哈希函数中随机选取一个哈希函数,作为目标哈希函数。利用目标哈希函数对该分词进行哈希计算,得到该分词的目标哈希值。最后,利用满足差分隐私的方式,基于目标哈希值确定目标向量。
例如,在一种具体实现方式中,该分词对应的字符串为S。可以从k个预设的哈希函数H 1,H 2……H k中随机选取一个哈希函数H j,作为目标哈希函数。其中,j为H j对应的编号。利用H j对该分词的字符串S进行哈希计算,得到目标哈希值h=H j(S)。另外,生成一个随机向量v,随机向量v在各个维度上的取值为1或-1,其中,各个维度上的取值为-1的概率为:
Figure PCTCN2022073677-appb-000001
其中,ε为预设的隐私预算,用于表示隐私保护水平。将v的第h位进行翻转操作,得到目标向量。例如,如果v的第h位为1,则将第h位的1翻转为-1,如果v的第h位为-1,则将第h位的-1翻转为1。
可以理解,上述具体实现方式仅仅是示例性的举例说明,还可以通过其它任意合理的方式对该分词进行本地差分隐私处理,得到目标向量,本实施例对目标向量的具体获取方式方面不限定。
在步骤202中,划分出N组分词信息,使同组的各个分词信息对应于相同的目标个数。
在本实施例中,可以将来自多个终端设备的多个分词信息进行汇总,并进行分组,划分出N组分词信息,使同组的各个分词信息对应于相同的目标个数。例如,预设值N=3,可以分出G1、G2、G3三组分词信息,其中,G1组中的各个分词信息对应的目标个数均为1,G2组中的各个分词信息对应的目标个数均为2,G3组中的各个分词信息对应的目标个数均为3。
在步骤203中,确定各组分词信息各自对应的表示分词频度无偏估计的各组估计数据。
在本实施例中,针对任意一组分词信息,可以确定该组分词信息对应的表示分词频度无偏估计的估计数据。因为,划分出N组分词信息,所以,可以得到N组估计数据,并且同组的各个估计数据也对应于相同的目标个数。
例如,参考步骤201提供的一种具体实现方式,任意一个分词信息可以包括目标向量以及目标个数,其中,该分词信息还可以包括得到该目标向量所采用的目标哈希函数的编号j。可以利用以下公式计算该组分词信息中每个分词信息对应的参考向量:
Figure PCTCN2022073677-appb-000002
其中,
Figure PCTCN2022073677-appb-000003
表示该组分词信息中第i个分词信息对应的目标向量,
Figure PCTCN2022073677-appb-000004
表示第i个分词信息对应的参考向量,
Figure PCTCN2022073677-appb-000005
为和目标向量维度相同的单位向量,k为预设的哈希函数的个数,c为一个常数,c可以表示为:
Figure PCTCN2022073677-appb-000006
其中,ε为预设的隐私预算,和步骤201提供的一种具体实现方式中涉及到的隐私预算ε相等。
接着,将该组分词信息按照对应的目标哈希函数的编号,将编号相同的分词信息所对应的参考向量相加,得到k个向量。将这k个向量作为行向量或列向量,按照编号从小到大的顺序进行排列,得到目标矩阵。该目标矩阵即为该组分词信息对应的表示分词频度无偏估计的一组估计数据。
可以理解,上述举例仅仅是针对步骤201涉及的一种具体实现方式,而提供的一种确定各组估计数据的实现方式。实际上,针对不同的差分隐私处理方式,可以通过不同的方式方面确定各组估计数据。本实施例对确定各组估计数据的具体方式方面不限定。
在步骤204中,基于各组估计数据,逐层生成用于记录分词频度的前缀树的各层节点。
在本实施例中,可以基于各组估计数据,逐层生成用于记录分词频度的前缀树的各层节点,从而得到前缀树。具体来说,首先,生成前缀树的根节点,将前缀树的根节点作为第0层节点,该第0层节点表示空字符。接着,从第1层开始,逐层生成前缀树的各层节点。
具体来说,通过以下步骤a-步骤d,生成第n层节点,n为大于等于1,小于等于N的整数:步骤a:获取第n-1层中的各个节点各自表示的各个n-1元分词。
在本实施例中,获取第n-1层中的各个节点各自表示的各个n-1元分词。其中,第n-1层中的任一节点表示的n-1元分词通过将从根节点到该节点所对应的词语单元顺序排列而形成。其中,若n=1,则获取第0层节点表示的0元分词,即空字符。
步骤b:基于各个n-1元分词,确定用于第n层节点的多个备选n元分词。
在本实施例中,可以基于各个n-1元分词,确定用于第n层节点的多个备选n元分词。具体来说,可以将各个n-1元分词作为前缀,分别与预设词典中的各个预设词语单元组合,将如此构成的多个n元分词确定为多个备选n元分词。例如,若n-1层节点中2个节点表示的n-1元分词分别为x和w(x和w均为由n-1个词语单元构成的n-1元分词),预设词典中包括词语单元A、B、C,则以x为前缀,分别与A、B、C组合, 构成的n元分词为xA、xB、xC,以w为前缀,分别与A、B、C组合,构成的n元分词为wA、wB、wC。可以将xA、xB、xC、wA、wB、wC确定为备选n元分词。
步骤c:基于对应于目标个数为n的第n组估计数据,计算备选n元分词的频度显著性分布信息。
在本实施例中,可以基于第n组估计数据,计算备选n元分词的频度显著性分布信息。其中,第n组估计数据对应的目标个数为n。具体来说,可以通过如下方式计算备选n元分词的频度显著性分布信息:首先,可以基于第n组估计数据,计算各个备选n元分词的各个频度。参考步骤203提供的一种具体实现方式,得到目标矩阵作为表示分词频度无偏估计的一组估计数据,第n个矩阵为第n组估计数据。针对任意一个备选n元分词D,可以利用上述k个预设的哈希函数H 1,H 2……H k,分别对该备选n元分词D进行哈希计算,得到目标哈希值H 1(D),H 2(D)……H k(D)。再从第n个矩阵中查找以上述目标哈希值为列的各个目标元素,计算目标元素的平均值,并基于该平均值得到该备选n元分词D的频度。
接着,可以基于各个备选n元分词的各个频度,计算各个备选n元分词各自对应的各个方差,从而得到各个方差对应的标准差。基于各个备选n元分词各自对应的各个方差,计算各个备选n元分词各自对应的各个z值。其中,针对任意一个备选n元分词,用该备选n元分词的频度除以该备选n元分词对应的标准差,得到该备选n元分词对应的z值。
最后,可以基于各个备选n元分词各自对应的各个z值,计算各个备选n元分词各自对应的各个p值(一个样本对应的p值为抽取到该样本或抽取到比该样本更极端情况样本的概率),作为各个备选n元分词各自对应的频度显著性分布信息。其中,针对任意一个备选n元分词,该备选n元分词对应的p值为标准正态分布中随机变量大于该备选n元分词对应的z值的概率。
在其他实施例中,还可以选取其他基于频度的指标作为显著性分布信息,例如,将前述z值,或者基于z值确定的其他统计分布量作为显著性分布信息。
步骤d:基于上述频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词,并利用第n层中的各个节点记录各自表示的各个n元分词的频度。
在本实施例中,可以基于上述频度显著性分布信息,从多个备选n元分词中选取满足一定条件的至少一个备选n元分词,作为第n层节点表示的n元分词,并利用第n层 中的各个节点记录各自表示的各个n元分词的频度。在将p值作为显著性分布信息的实施例中,还可以利用第n层中的各个节点记录各自表示的各个n元分词的方差以及p值。
例如,备选n元分词包括A、B、C、D、E、F,可以基于上述频度显著性分布信息,从备选n元分词中选取满足一定条件的A、B、C,作为第n层节点表示的n元分词。然后,生成第n层的3个节点a、b、c,这3个节点a、b、c分别表示A、B、C。并且,节点a记录A的频度、方差及p值,节点b记录B的频度、方差及p值,节点c记录C的频度、方差及p值。
本说明书的上述实施例提供的估计差分隐私保护数据中分词频度的方法,通过获取终端设备上报的、经本地差分隐私处理的各个分词信息,划分出N组分词信息,使同组的各个分词信息对应于相同的目标个数,确定各组分词信息各自对应的表示分词频度无偏估计的各组估计数据,并基于各组估计数据,逐层生成用于记录分词频度的前缀树的各层节点。由于本实施例在生成前缀树的第n层节点的过程中,能够基于备选n元分词的频度显著性分布信息,选择部分备选n元分词作为第n层节点表示的n元分词,无需遍历由预设的词语单元构成的所有n元分词,不仅大大降低了计算量,提高了计算效率,而且基于分词的频度显著性分布信息而筛选出来的第n层节点表示的n元分词更具合理性。
如图3所示,图3实施例描述了选择若干备选n元分词作为第n层节点表示的n元分词的过程,该方法可以应用于服务器中,包括以下步骤:在步骤301中,将各个备选n元分词各自对应的各个p值按从小到大的顺序进行排列。
在步骤302中,选取满足预设条件的最大p值作为目标p值。
在本实施例中,对于任意一个p值,若该p值小于等于该p值对应的目标结果,则该p值满足预设条件。该目标结果为该p值在排列中的序号和针对第n层设定的预设阈值的乘积除以备选n元分词的个数所得结果。
例如,将各个备选n元分词各自对应的各个p值按从小到大的顺序进行排列,用p i表示排列中第i个p值。若p i≤(i/N)*α n,则p i满足预设条件。将最大的满足预设条件的p值作为目标p值。其中,α n为针对第n层节点设定的预设阈值,一般来说,n越大,α n越大。
在步骤303中,选取小于目标p值的各个p值各自对应的各个备选n元分词,作为第n层节点表示的n元分词。
本说明书的上述实施例提供的估计差分隐私保护数据中分词频度的方法,通过将各个备选n元分词各自对应的各个p值按从小到大的顺序进行排列,选取满足预设条件的最大p值作为目标p值,并选取小于目标p值的各个p值各自对应的各个备选n元分词,作为第n层节点表示的n元分词。从而使得基于分词的各个p值而筛选出来的第n层节点表示的n元分词进一步具有合理性。
在一些可选实施方式中,上述方法还可以包括:利用第n层中的各个节点记录各自表示的各个n元分词的方差以及p值。
应当注意,尽管在上述实施例中,以特定顺序描述了本说明书实施例的方法的操作,但是,这并非要求或者暗示必须按照该特定顺序来执行这些操作,或是必须执行全部所示的操作才能实现期望的结果。相反,流程图中描绘的步骤可以改变执行顺序。附加地或备选地,可以省略某些步骤,将多个步骤合并为一个步骤执行,和/或将一个步骤分解为多个步骤执行。
下面结合一个完整的应用实例,对本说明书一个或多个实施例的方案进行示意性说明。
应用场景可以为:预设词典中包括词语单元A、B、C、D,服务器需要基于预设词典估计差分隐私保护数据中1元分词,2元分词和3元分词的频度。
具体来说,首先,服务器获取终端设备上报的、经本地差分隐私处理的各个分词信息。并划分出第1组分词信息,第2组分词信息和第3组分词信息。其中,第1组分词信息对应的目标个数均为1,第2组分词信息对应的目标个数均为2,第3组分词信息对应的目标个数均为3。
接着,分别确定第1组分词信息对应的表示分词频度无偏估计的第1组估计数据,第2组分词信息对应的表示分词频度无偏估计的第2组估计数据,第3组分词信息对应的表示分词频度无偏估计的第3组估计数据。
然后,逐层生成用于记录分词频度的前缀树的各层节点。如图4所示,可以先生成前缀树的根节点,作为第0层节点,该第0层节点表示空字符。接着,将预设词典中的词语单元A、B、C、D确定为4个备选1元分词。分别计算各个备选1元分词对应的频度显著性分布信息,基于各个备选1元分词对应的频度显著性分布信息,选取A、B、C、D作为第1层节点表示的1元分词。构建根节点的子节点a,节点a表示1元分词A,并利用节点a记录1元分词A的频度。构建根节点的子节点b,节点b表示1元分词B, 并利用节点b记录1元分词B的频度。构建根节点的子节点c,节点c表示1元分词C,并利用节点c记录1元分词C的频度。构建根节点的子节点d,节点d表示1元分词D,并利用节点d记录1元分词D的频度。其中,节点a、节点b、节点c和节点d即为前缀树第1层的节点。
接着,获取第1层中的节点a、节点b、节点c和节点d各自表示的1元分词A、B、C和D,将1元分词A、B、C和D作为前缀,分别与预设词典中的词语单元A、B、C、D构成多个2元分词AA、AB、AC、AD、BA、BB、BC、BD、CA、CB、CC、CD、DA、DB、DC、DD,作为多个备选2元分词。分别计算各个备选2元分词对应的频度显著性分布信息,基于各个备选2元分词对应的频度显著性分布信息,选取AB、AC、BC、BD作为第2层节点表示的2元分词。构建节点a的子节点ab和ac,构建节点b的子节点bc和bd。其中,节点ab表示2元分词AB,并利用节点ab记录2元分词AB的频度。节点ac表示2元分词AC,并利用节点ac记录2元分词AC的频度。节点bc表示2元分词BC,并利用节点bc记录2元分词BC的频度。节点bd表示2元分词BD,并利用节点bd记录2元分词BD的频度。其中,节点ab、节点ac、节点bc和节点bd即为前缀树第2层的节点。
最后,获取第2层中的节点ab、节点ac、节点bc和节点bd各自表示的2元分词AB、AC、BC、BD,将2元分词AB、AC、BC、BD作为前缀,分别与预设词典中的词语单元A、B、C、D构成多个3元分词ABA、ABB、ABC、ABD、ACA、ACB、ACC、ACD、BCA、BCB、BCC、BCD、BDA、BDB、BDC、BDD作为多个备选3元分词。分别计算各个备选3元分词对应的频度显著性分布信息,基于各个备选3元分词对应的频度显著性分布信息,选取ABA、ABB、ACC、ACD、BDC、BDD作为第3层节点表示的3元分词。构建节点ab的子节点aba和abb,构建节点ac的子节点acc和acd,构建节点bd的子节点bdc和bdd。其中,节点aba表示3元分词ABA,并利用节点aba记录3元分词ABA的频度。节点abb表示3元分词ABB,并利用节点abb记录3元分词ABB的频度。节点acc表示3元分词ACC,并利用节点acc记录3元分词ACC的频度。节点acd表示3元分词ACD,并利用节点acd记录3元分词ACD的频度。节点bdc表示3元分词BDC,并利用节点bdc记录3元分词BDC的频度。节点bdd表示3元分词BDD,并利用节点bdd记录3元分词BDD的频度。其中,节点aba、节点abb、节点acc、节点acd、节点bdc和节点bdd即为前缀树第3层的节点。
可见,应用上述方案,逐层生成用于记录分词频度的前缀树的各层节点,在生成第 n层节点的过程中,能够基于备选n元分词的频度显著性分布信息,选择部分备选n元分词作为第n层节点表示的n元分词,无需遍历由预设的词语单元构成的所有n元分词,不仅大大降低了计算量,提高了计算效率,而且基于分词的频度显著性分布信息而筛选出来的第n层节点表示的n元分词更具合理性。
与前述估计差分隐私保护数据中分词频度的方法实施例相对应,本说明书还提供了估计差分隐私保护数据中分词频度的装置的实施例。
如图5所示,图5示出的该装置应用于服务器,该装置可以包括:获取模块501,分组模块502,确定模块503和生成模块504。
其中,获取模块501,用于获取终端设备上报的、经本地差分隐私处理的各个分词信息。任一分词信息对应于一个分词,并包括表示该分词中包含的词语单元数量的目标个数,该目标个数小于等于预设值N。
分组模块502,用于划分出N组分词信息,使同组的各个分词信息对应于相同的目标个数。
确定模块503,用于确定各组分词信息各自对应的表示分词频度无偏估计的各组估计数据。
生成模块504,用于基于各组估计数据,逐层生成用于记录分词频度的前缀树的各层节点,其中,生成模块504通过如下方式生成第n层节点:获取第n-1层中的各个节点各自表示的各个n-1元分词,第n-1层中的任一节点表示的n-1元分词通过将从根节点到该节点所对应的词语单元顺序排列而形成。基于各个n-1元分词,确定用于第n层节点的多个备选n元分词,基于对应于目标个数为n的第n组估计数据,计算备选n元分词的频度显著性分布信息,基于上述频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词,并利用第n层中的各个节点记录各自表示的各个n元分词的频度,1≤n≤N。
在一种实施方式中,上述前缀树的根节点为第0层节点,第0层节点表示空字符。
在另一种实施方式中,生成模块504通过如下方式基于各个n-1元分词,确定用于第n层节点的多个备选n元分词:将各个n-1元分词作为前缀,分别与预设词典中的各个预设词语单元构成的多个n元分词确定为所述多个备选n元分词。
在另一种实施方式中,生成模块504通过如下方式基于对应于目标个数为n的第n组估计数据,计算备选n元分词的频度显著性分布信息:基于第n组估计数据,计算各 个备选n元分词的各个频度,基于各个频度,计算各个备选n元分词各自对应的各个方差,基于各个方差,计算备选n元分词的频度显著性分布信息。
在另一种实施方式中,生成模块504通过如下方式基于各个方差,计算备选n元分词的频度显著性分布信息:基于各个方差,计算各个备选n元分词各自对应的各个z值,基于各个z值,计算各个备选n元分词各自对应的各个p值,作为备选n元分词的频度显著性分布信息。
其中,生成模块404通过如下方式基于频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词:基于各个p值,选择若干备选n元分词作为第n层节点表示的n元分词。
在另一种实施方式中,生成模块504通过如下方式基于各个p值,选择若干备选n元分词作为第n层节点表示的n元分词:将各个p值按从小到大的顺序进行排列,选取满足预设条件的最大p值作为目标p值。满足预设条件的任一p值小于等于该p值对应的目标结果,该目标结果为该p值在排列中的序号和针对第n层设定的预设阈值的乘积除以备选n元分词的个数所得结果。选取小于目标p值的各个p值各自对应的各个备选n元分词,作为第n层节点表示的n元分词。
在另一种实施方式中,生成模块504还被配置用于:利用第n层中的各个节点记录各自表示的各个n元分词的方差以及p值。
在另一种实施方式中,任一分词信息还可以包括表示该分词的目标向量,该目标向量经过本地差分隐私处理。
在另一种实施方式中,表示该分词的目标向量通过如下方式得到:从多个预设的哈希函数中选取一个哈希函数作为目标哈希函数,利用目标哈希函数计算该分词的目标哈希值,利用满足差分隐私的方式,基于目标哈希值确定目标向量。
应当理解,上述装置可以预先设置在服务器中,也可以通过下载等方式而加载到服务器中。上述装置中的相应模块可以与服务器中的模块相互配合以实现估计差分隐私保护数据中分词频度的方案。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根 据实际的需要选择其中的部分或者全部模块来实现本说明书一个或多个实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
本说明书一个或多个实施例还提供了一种计算机可读存储介质,该存储介质存储有计算机程序,计算机程序可用于执行上述图2至图3任一实施例提供的估计差分隐私保护数据中分词频度的方法。
对应于上述的估计差分隐私保护数据中分词频度的方法,本说明书一个或多个实施例还提出了图6所示的根据本说明书的一示例性实施例的电子设备的示意结构图。请参考图6,在硬件层面,该电子设备包括处理器、内部总线、网络接口、内存以及非易失性存储器,当然还可能包括其他业务所需要的硬件。处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,在逻辑层面上形成估计差分隐私保护数据中分词频度的装置。当然,除了软件实现方式之外,本说明书一个或多个实施例并不排除其他实现方式,比如逻辑器件抑或软硬件结合的方式等等,也就是说以下处理流程的执行主体并不限定于各个逻辑单元,也可以是硬件或逻辑器件。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述地比较简单,相关之处参见方法实施例的部分说明即可。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本领域普通技术人员应该还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。其中,软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的 存储介质中。
以上所述的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种估计差分隐私保护数据中分词频度的方法,应用于服务器,所述方法包括:
    获取终端设备上报的、经本地差分隐私处理的各个分词信息;任一分词信息对应于一个分词,并包括表示该分词中包含的词语单元数量的目标个数,该目标个数小于等于预设值N;
    划分出N组分词信息,使同组的各个分词信息对应于相同的目标个数;
    确定各组分词信息各自对应的表示分词频度无偏估计的各组估计数据;
    基于所述各组估计数据,逐层生成用于记录分词频度的前缀树的各层节点,其中,生成第n层节点包括:获取第n-1层中的各个节点各自表示的各个n-1元分词,第n-1层中的任一节点表示的n-1元分词通过将从根节点到该节点所对应的词语单元顺序排列而形成;基于所述各个n-1元分词,确定用于第n层节点的多个备选n元分词;基于对应于目标个数为n的第n组估计数据,计算所述备选n元分词的频度显著性分布信息;基于所述频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词,并利用第n层中的各个节点记录各自表示的各个n元分词的频度;1≤n≤N。
  2. 根据权利要求1所述的方法,其中,所述前缀树的根节点为第0层节点,所述第0层节点表示空字符。
  3. 根据权利要求1所述的方法,其中,所述基于所述各个n-1元分词,确定用于第n层节点的多个备选n元分词,包括:
    将所述各个n-1元分词作为前缀,分别与预设词典中的各个预设词语单元构成的多个n元分词确定为所述多个备选n元分词。
  4. 根据权利要求1所述的方法,其中,所述基于对应于目标个数为n的第n组估计数据,计算所述备选n元分词的频度显著性分布信息,包括:
    基于所述第n组估计数据,计算各个备选n元分词的各个频度;
    基于所述各个频度,计算所述各个备选n元分词各自对应的各个方差;
    基于所述各个方差,计算所述备选n元分词的频度显著性分布信息。
  5. 根据权利要求4所述的方法,其中,所述基于所述各个方差,计算所述备选n元分词的频度显著性分布信息,包括:
    基于所述各个方差,计算所述各个备选n元分词各自对应的各个z值;
    基于所述各个z值,计算所述各个备选n元分词各自对应的各个p值,作为所述备选n元分词的频度显著性分布信息;
    其中,所述基于所述频度显著性分布信息,选择若干备选n元分词作为第n层节点 表示的n元分词,包括:
    基于所述各个p值,选择若干备选n元分词作为第n层节点表示的n元分词。
  6. 根据权利要求5所述的方法,其中,所述基于所述各个p值,选择若干备选n元分词作为第n层节点表示的n元分词,包括:
    将所述各个p值按从小到大的顺序进行排列;
    选取满足预设条件的最大p值作为目标p值;满足所述预设条件的任一p值小于等于该p值对应的目标结果,该目标结果为该p值在所述排列中的序号和针对第n层设定的预设阈值的乘积除以备选n元分词的个数所得结果;
    选取小于所述目标p值的各个p值各自对应的各个备选n元分词,作为所述第n层节点表示的n元分词。
  7. 根据权利要求4-6中任一项所述的方法,其中,所述方法还包括:
    利用第n层中的各个节点记录各自表示的各个n元分词的方差以及p值。
  8. 根据权利要求1所述的方法,其中,所述任一分词信息还包括表示该分词的目标向量,该目标向量经过本地差分隐私处理。
  9. 根据权利要求8所述的方法,其中,所述表示该分词的目标向量通过如下方式得到:
    从多个预设的哈希函数中选取一个哈希函数作为目标哈希函数;
    利用所述目标哈希函数计算该分词的目标哈希值;
    利用满足差分隐私的方式,基于所述目标哈希值确定所述目标向量。
  10. 一种估计差分隐私保护数据中分词频度的装置,应用于服务器,所述装置包括:
    获取模块,用于获取终端设备上报的、经本地差分隐私处理的各个分词信息;任一分词信息对应于一个分词,并包括表示该分词中包含的词语单元数量的目标个数,该目标个数小于等于预设值N;
    分组模块,用于划分出N组分词信息,使同组的各个分词信息对应于相同的目标个数;
    确定模块,用于确定各组分词信息各自对应的表示分词频度无偏估计的各组估计数据;
    生成模块,用于基于所述各组估计数据,逐层生成用于记录分词频度的前缀树的各层节点,其中,所述生成模块通过如下方式生成第n层节点:获取第n-1层中的各个节点各自表示的各个n-1元分词,第n-1层中的任一节点表示的n-1元分词通过将从根节 点到该节点所对应的词语单元顺序排列而形成;基于所述各个n-1元分词,确定用于第n层节点的多个备选n元分词;基于对应于目标个数为n的第n组估计数据,计算所述备选n元分词的频度显著性分布信息;基于所述频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词,并利用第n层中的各个节点记录各自表示的各个n元分词的频度;1≤n≤N。
  11. 根据权利要求10所述的装置,其中,所述前缀树的根节点为第0层节点,所述第0层节点表示空字符。
  12. 根据权利要求10所述的装置,其中,所述生成模块通过如下方式基于所述各个n-1元分词,确定用于第n层节点的多个备选n元分词:
    将所述各个n-1元分词作为前缀,分别与预设词典中的各个预设词语单元构成的多个n元分词确定为所述多个备选n元分词。
  13. 根据权利要求10所述的装置,其中,所述生成模块通过如下方式基于对应于目标个数为n的第n组估计数据,计算所述备选n元分词的频度显著性分布信息:
    基于所述第n组估计数据,计算各个备选n元分词的各个频度;
    基于所述各个频度,计算所述各个备选n元分词各自对应的各个方差;
    基于所述各个方差,计算所述备选n元分词的频度显著性分布信息。
  14. 根据权利要求13所述的装置,其中,所述生成模块通过如下方式基于所述各个方差,计算所述备选n元分词的频度显著性分布信息:
    基于所述各个方差,计算所述各个备选n元分词各自对应的各个z值;
    基于所述各个z值,计算所述各个备选n元分词各自对应的各个p值,作为所述备选n元分词的频度显著性分布信息;
    其中,所述生成模块通过如下方式基于所述频度显著性分布信息,选择若干备选n元分词作为第n层节点表示的n元分词:
    基于所述各个p值,选择若干备选n元分词作为第n层节点表示的n元分词。
  15. 根据权利要求14所述的装置,其中,所述生成模块通过如下方式基于所述各个p值,选择若干备选n元分词作为第n层节点表示的n元分词:
    将所述各个p值按从小到大的顺序进行排列;
    选取满足预设条件的最大p值作为目标p值;满足所述预设条件的任一p值小于等于该p值对应的目标结果,该目标结果为该p值在所述排列中的序号和针对第n层设定的预设阈值的乘积除以备选n元分词的个数所得结果;
    选取小于所述目标p值的各个p值各自对应的各个备选n元分词,作为所述第n层 节点表示的n元分词。
  16. 根据权利要求13-15中任一项所述的装置,其中,所述生成模块还被配置用于:
    利用第n层中的各个节点记录各自表示的各个n元分词的方差以及p值。
  17. 根据权利要求10所述的装置,其中,所述任一分词信息还包括表示该分词的目标向量,该目标向量经过本地差分隐私处理。
  18. 根据权利要求17所述的装置,其中,所述表示该分词的目标向量通过如下方式得到:
    从多个预设的哈希函数中选取一个哈希函数作为目标哈希函数;
    利用所述目标哈希函数计算该分词的目标哈希值;
    利用满足差分隐私的方式,基于所述目标哈希值确定所述目标向量。
  19. 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述权利要求1-9中任一项所述的方法。
  20. 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述权利要求1-9中任一项所述的方法。
PCT/CN2022/073677 2021-02-05 2022-01-25 估计差分隐私保护数据中分词频度的方法及装置 WO2022166676A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/275,995 US20240104304A1 (en) 2021-02-05 2022-01-25 Methods and apparatuses for estimating word segment frequency in differential privacy protection data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110161186.9A CN112507710B (zh) 2021-02-05 2021-02-05 估计差分隐私保护数据中分词频度的方法及装置
CN202110161186.9 2021-02-05

Publications (1)

Publication Number Publication Date
WO2022166676A1 true WO2022166676A1 (zh) 2022-08-11

Family

ID=74952724

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/073677 WO2022166676A1 (zh) 2021-02-05 2022-01-25 估计差分隐私保护数据中分词频度的方法及装置

Country Status (3)

Country Link
US (1) US20240104304A1 (zh)
CN (1) CN112507710B (zh)
WO (1) WO2022166676A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507710B (zh) * 2021-02-05 2021-05-25 支付宝(杭州)信息技术有限公司 估计差分隐私保护数据中分词频度的方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280366A (zh) * 2018-01-17 2018-07-13 上海理工大学 一种基于差分隐私的批量线性查询方法
CN109829320A (zh) * 2019-01-14 2019-05-31 珠海天燕科技有限公司 一种信息的处理方法和装置
US10878174B1 (en) * 2020-06-24 2020-12-29 Starmind Ag Advanced text tagging using key phrase extraction and key phrase generation
CN112507710A (zh) * 2021-02-05 2021-03-16 支付宝(杭州)信息技术有限公司 估计差分隐私保护数据中分词频度的方法及装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110727958B (zh) * 2019-10-15 2023-04-28 南京航空航天大学 一种基于前缀树的差分隐私轨迹数据保护方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280366A (zh) * 2018-01-17 2018-07-13 上海理工大学 一种基于差分隐私的批量线性查询方法
CN109829320A (zh) * 2019-01-14 2019-05-31 珠海天燕科技有限公司 一种信息的处理方法和装置
US10878174B1 (en) * 2020-06-24 2020-12-29 Starmind Ag Advanced text tagging using key phrase extraction and key phrase generation
CN112507710A (zh) * 2021-02-05 2021-03-16 支付宝(杭州)信息技术有限公司 估计差分隐私保护数据中分词频度的方法及装置

Also Published As

Publication number Publication date
CN112507710A (zh) 2021-03-16
US20240104304A1 (en) 2024-03-28
CN112507710B (zh) 2021-05-25

Similar Documents

Publication Publication Date Title
JP7241862B2 (ja) 機械学習モデルを使用した、偏りのあるデータの拒否
Wang et al. Penalized generalized estimating equations for high-dimensional longitudinal data analysis
TWI658420B (zh) 融合時間因素之協同過濾方法、裝置、伺服器及電腦可讀存儲介質
Fan et al. Asymptotic equivalence of regularization methods in thresholded parameter space
WO2018059302A1 (zh) 文本识别方法、装置及存储介质
WO2019238125A1 (zh) 信息处理方法、相关设备及计算机存储介质
WO2022166676A1 (zh) 估计差分隐私保护数据中分词频度的方法及装置
US11782991B2 (en) Accelerated large-scale similarity calculation
US20220270299A1 (en) Enabling secure video sharing by exploiting data sparsity
Tian et al. Variable selection in the high-dimensional continuous generalized linear model with current status data
CN113409827B (zh) 基于局部卷积块注意力网络的语音端点检测方法及系统
KR102339723B1 (ko) Dna 저장 장치의 연성 정보 기반 복호화 방법, 프로그램 및 장치
US20210357955A1 (en) User search category predictor
Zhou et al. A novel locality-sensitive hashing algorithm for similarity searches on large-scale hyperspectral data
CN108470181B (zh) 一种基于加权序列关系的Web服务替换方法
Lai et al. Estimation and variable selection for generalised partially linear single-index models
US20160275169A1 (en) System and method of generating initial cluster centroids
US20210056586A1 (en) Optimizing large scale data analysis
JP2010108488A (ja) 集計システム、集計処理装置、情報提供者端末、集計方法およびプログラム
CN115578583B (zh) 图像处理方法、装置、电子设备和存储介质
CN114239603A (zh) 业务需求匹配方法、装置、计算机设备和存储介质
CN117319475A (zh) 通信资源推荐方法、装置、计算机设备和存储介质
CN117312892A (zh) 用户聚类方法、装置、计算机设备和存储介质
Chen et al. Particle swarm stepwise (PaSS) algorithm for information criteria-based variable selections
CN115017982A (zh) 无监督文本聚类方法、装置、计算机设备、存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22748955

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18275995

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 11202305915S

Country of ref document: SG

122 Ep: pct application non-entry in european phase

Ref document number: 22748955

Country of ref document: EP

Kind code of ref document: A1