CN105447525A

CN105447525A - Data prediction classification method and device

Info

Publication number: CN105447525A
Application number: CN201510932807.3A
Authority: CN
Inventors: 丁丽萍; 穆海蓉; 宋宇宁
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2015-12-15
Filing date: 2015-12-15
Publication date: 2016-03-30

Abstract

The invention discloses a data prediction classification method and device relating to the data process technique field, solving the problem in the prior art that classification result itself and classification count value are possible to leak the private information of a user. The method comprises: building a random forest namely multiple decision trees through a training dataset; carrying out prediction classification to a test dataset by the decision trees in the random forest, and obtaining the classification result satisfying differential privacy. The invention can realize high accuracy prediction classification of the high-dimension large-scale data.

Description

A kind of data prediction sorting technique and device

Technical field

The present invention relates to technical field of data processing, particularly relate to a kind of data prediction sorting technique and device.

Background technology

Classification is the important data digging method of a class, its objective is the model found out and describe and distinguish data class or concept, so that the class label of the forecasting object that can use a model.The Typical Representative of disaggregated model is decision tree, and this structure is a kind of tree-like disaggregated model, and tree interior nodes represents the test on certain attribute, and leaf node represents a class.But classification results itself and differential count value all likely reveal user privacy information.Conventional privacy protection under decision tree classification mostly by disturbance of data as added random noise or K-anonymous methods; or by realizing raw data and results of intermediate calculations encryption; but when assailant possesses certain background knowledge; just there is hidden danger in classic method, assailant can utilize and identify that the attack method such as attack, background knowledge attack is to confirm user privacy information again.In addition, conventional privacy protection model cannot its secret protection level of quantitative test.

Difference privacy is as a kind of new secret protection model, two large defects of conventional privacy protection model can be solved: (1) defines a quite strict attack model, be indifferent to assailant and have how many background knowledges, even if assailant has grasped all recorded informations except a certain bar record, the privacy information of this record also cannot be disclosed; (2) rigorous definition and quantitative estimation method is given to secret protection level.

As the simplest disaggregated model, the decision tree classification under difference privacy has more correlative study.The existing representative method in conjunction with difference privacy and decision tree has SuLQ-basedID3, DiffP-C4.5 and DiffGen, although all progressively obtain certain progress in nicety of grading and practical application angle, but owing to selecting the entropy all needing to calculate each attribute during Split Attribute at every turn, when the dimension of categorical attribute is very large, the system of selection efficiency based on index mechanism can be caused very low, and likely exhaust privacy budget, existing method effectively can not be applied to the classification of a large amount of inquiries and high-dimensional attribute.For the classification of the decision tree boosting algorithm under difference privacy, existing scholar and the correlative study of related scientific research institutions conduct both at home and abroad, but ubiquity efficiently can not solve the deficiency that high-dimensional connection attribute exists in assorting process.

Summary of the invention

The invention provides a kind of data prediction sorting technique and device, the problem that classification results itself and differential count value likely reveal user privacy information can be solved.

First aspect, the invention provides a kind of data prediction sorting technique, comprising:

Random forest and many decision trees are set up by training dataset;

Utilize the decision tree in the random forest set up to carry out prediction classification to test data set, obtain the classification results meeting difference privacy.

Second aspect, the invention provides a kind of data prediction sorter, comprising:

Set up unit, for being set up random forest and many decision trees by training dataset;

Prediction taxon, for utilizing the decision tree in the random forest of foundation to carry out prediction classification to test data set, obtains the classification results meeting difference privacy.

Data prediction sorting technique provided by the invention and device, set up random forest by training dataset, and utilize the decision tree in the random forest set up to carry out prediction classification to test data set, obtain the classification results meeting difference privacy.Compared with prior art, because random forest shows well on large data sets, very high-dimensional data can be processed, and training speed is fast, thus the pin-point accuracy prediction classification to high-dimensional large-scale data can be realized.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

The process flow diagram of the data prediction sorting technique that Fig. 1 embodiment of the present invention provides;

The structural representation of the data prediction sorter that Fig. 2 provides for the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making other embodiments all obtained under creative work prerequisite, belong to the scope of protection of the invention.

First some technology that the embodiment of the present invention uses are introduced below.

Difference privacy is the secret protection technology based on data distortion.By adding noise and make data distortion in inquiry or analysis result, guarantee that the operation of inserting or deleting a certain bar record in data centralization can not affect the Output rusults of any inquiry, thus reach the object of secret protection.The formal definitions of difference privacy is as follows:

ε-difference privacy: two the adjacent data collection D all difference being at most to a record ₁and D ₂, given privacy algorithm K, Range (K) represent K span.If algorithm K provides ε-difference privacy, then for all S ∈ Range (K), have

Pr[K(D ₁)∈S]≤exp(ε)·Pr[K(D ₂)∈S]

Wherein, probability P r [] represents that privacy discloses risk, and privacy budget ε represents secret protection level, and the less level of protection of ε is higher.

Laplce's mechanism is one of major technique realizing difference privacy, add noise level and overall susceptibility is closely related.

Overall situation susceptibility: for any one function f: D → R ^d, the overall susceptibility of f is defined as:

Δf＝max _D1,D2||f(D ₁)-f(D ₂)|| ₁

Wherein, D ₁and D ₂for adjacent data collection, the inquiry dimension of d representative function f, R represents mapped real number space.

Laplce's mechanism: for any one function f: D → R ^dif the Output rusults of algorithm K meets following equalities, then K meets ε-difference privacy.

K(D)＝f(D)+<Lap ₁(Δf/ε),…,Lap _d(Δf/ε)>

Wherein, Lap _i(Δ f/ ε) (1≤i≤d) is separate Laplace variable, and corresponding probability density function is

p (x | b) = \frac{1}{2 b} \exp (- \frac{| x |}{b})

Noise level is directly proportional to Δ f, is inversely proportional to ε, and namely function f overall situation susceptibility is larger, and ε is less in privacy budget, adds noise larger.Some Output rusults of the main process of Laplce's mechanism are the algorithm of Real-valued.

Index mechanism: establish random algorithm M to be input as data set D, output is entity object r ∈ Rangeq (D, r) is availability function, and Δ q is the susceptibility of function q (D, r), if algorithm M is to be proportional to probability select from Range and export r, so algorithm M provides ε-difference secret protection.

Random forest: utilize many to set sample training and a kind of sorter of prediction.Random forest is made up of many decision trees, and the mode that its classification exported is the classification exported by indivedual tree is determined.

Training process can be summarized as follows:

(a) given training set S, test set T, property set F.

Determine parameter: the quantity t of the decision tree of generation, the degree of depth d of every tree, the number of attributes f that each node uses;

End condition: the categorical attribute that node all records is consistent, or reaches depth capacity d.

B () has the random selecting size put back to be from S | training set S (i) of S|, as the sample of root node, train from root node.

If c () present node reaches end condition, then arranging present node is leaf node, then continues other nodes of training.If present node does not reach end condition, then the random selecting f dimension attribute that nothing is put back to from F dimension attribute.Utilize this f dimension attribute, find the best attribute k of classifying quality and classification results set thereof, on present node, sample kth dimension attribute is divided into child node according to classification results.Continue other nodes of training.

D () repeats (b), (c), until all nodes is all trained or be marked as leaf node.

E () repeats (b), (c), (d), until all decision trees are all trained.

Utilize the forecasting process of random forest as follows:

For a kth tree:

A (), from the root node of present tree, according to the classification results set of present node, judgement enters which child node, until arrive certain leaf node, and prediction of output value.

B () repeats (a) until all t tree all outputs predicted value.For classification problem, export as that maximum class of prediction probability summation in all trees, namely the p of each c (i) is added up.

The embodiment of the present invention provides a kind of data prediction sorting technique, and as shown in Figure 1, described method comprises:

S11, set up random forest and many decision trees by training dataset.

Input: training dataset S, property set F, categorical attribute collection C, privacy budget B, the quantity t of the decision tree generated in random forest, the degree of depth d of every tree

Export: the random forest meeting ε-difference privacy

End condition: the categorical attribute that node all records is consistent, or reaches depth capacity d

First according to the number set in parameter, privacy budget B is all given t tree; Each decision tree is recursively generated afterwards according to same rule.The strategy generating decision tree is as follows:

From S, random selecting size is | training set S (i) of S|.Every one deck (comprising leaf node) is given in the privacy budget that every is set, the privacy budget of every one deck is divided into two halves, half is used for estimating instance number, and second half is used for estimating class counting (leaf node) or evaluation attribute (other nodes).Then the function generating decision tree is recursively called.First use Laplce's mechanism to add example number to present node to make an uproar.Judge whether afterwards to reach end condition, if reach, to this leaf node mark classification, now apply Laplce's mechanism and counting of making an uproar is added to classification.If do not reach end condition, (size of f is got in general first from F attribute, to select f attribute at random ), if having connection attribute in the attribute chosen, need first to divide a part of privacy budget, in order to select the split point of each connection attribute to each connection attribute; Split Attribute is selected afterwards from all properties.Select all to use index mechanism to select when split point and Split Attribute, scoring functions q (the S (i) of this method Exponential mechanism, F) adopt information gain and maximum kind frequency and two kinds of methods, the susceptibility Δ q of scoring functions is respectively log ₂| C| and 1, wherein | C| is the size of categorical attribute collection.Final generation according to the method described above meets the decision tree of ε-difference privacy.

These decision trees composition generated meets the random forest of ε-difference privacy.Because the training sample of every tree is Stochastic choice, and in tree, each nodal community is also Stochastic choice, and random forest can not produce the phenomenon of overfitting, so do not need beta pruning.On each node, the number of attribute is generally the root mean square of whole attribute number, so also just solves the high-dimensional problem brought to a certain extent.

Random forest process of establishing under difference privacy describes as shown in table 1.

Table 1

Decision tree in the random forest that S12, utilization are set up carries out prediction classification to described training dataset, obtains the classification results meeting difference privacy.

Input: test set T, categorical attribute collection C, the set of tree

Export: the classification results of every bar record in test set

To the record of each in test set, the every one tree in application forest carries out classification prediction to it.All judge which child node this record should enter according to the classification results set of present node on each node, until arrive certain leaf node, obtain a predicted value C by current leaf node _b(x).That classification results of all middle maximum probabilities that predicts the outcome is obtained according to predicting the outcome of every tree in forest export the classification results of all records afterwards.

The process prescription that the random forest set up by table 1 is classified to training dataset is as shown in table 2.

Table 2

The data prediction sorting technique that the embodiment of the present invention provides, sets up random forest by training dataset, and utilizes the decision tree in the random forest set up to carry out prediction classification to test data set, obtains the classification results meeting difference privacy.Compared with prior art, because random forest shows well on large data sets, very high-dimensional data can be processed, and training speed is fast, thus the pin-point accuracy prediction classification to high-dimensional large-scale data can be realized.

The embodiment of the present invention also provides a kind of data prediction sorter, and as shown in Figure 2, described device comprises:

Set up unit 11, for being set up random forest and many decision trees by training dataset;

Prediction taxon 12, for utilizing the decision tree in the random forest of foundation to carry out prediction classification to test data set, obtains the classification results meeting difference privacy.

Further, describedly set up unit 11, also for privacy budget B is all given t tree, t is the quantity of the decision tree generated in random forest; Each decision tree is recursively generated according to same rule.

Further, describedly set up unit 11, for random selecting size from training dataset S be also | training set S (i) of S|; Every one deck is given in the privacy budget that every is set, and the privacy budget of every one deck is divided into two halves, and half is used for estimating instance number, and second half is used for estimating class counting or evaluation attribute; Recursively call the function generating decision tree; First use Laplce's mechanism to add example number to present node to make an uproar; Judge whether to reach end condition, if reach, to this leaf node mark classification, application Laplce mechanism adds to classification counting of making an uproar, if do not reach end condition, select f attribute at random in first dependency collection F, wherein, the size of f is got if have connection attribute in the attribute chosen, then divide a part of privacy budget first to each connection attribute, in order to select the split point of each connection attribute, from all properties, select Split Attribute afterwards; The decision tree meeting ε-difference privacy is generated according to said process.

Alternatively, select all to use index mechanism to select when split point and Split Attribute, the scoring functions employing information gain of described index mechanism and maximum kind frequency and two kinds of modes, the susceptibility of scoring functions is respectively log ₂| C| and 1, wherein | C| is the size of categorical attribute collection.

Further, described prediction taxon 12, for to the record of each in test set, every one tree in application forest carries out classification prediction to it, all judge which child node this record should enter according to the classification results set of present node on each node, until arrive certain leaf node, a predicted value is obtained by current leaf node, obtain that classification results of all middle maximum probabilities that predicts the outcome according to predicting the outcome of every tree in forest, export the classification results of all records.

The data prediction sorter that the embodiment of the present invention provides, sets up random forest by training dataset, and utilizes the decision tree in the random forest set up to carry out prediction classification to test data set, obtains the classification results meeting difference privacy.Compared with prior art, because random forest shows well on large data sets, very high-dimensional data can be processed, and training speed is fast, thus the pin-point accuracy prediction classification to high-dimensional large-scale data can be realized.

One of ordinary skill in the art will appreciate that all or part of flow process realized in above-described embodiment method, that the hardware that can carry out instruction relevant by computer program has come, described program can be stored in computer read/write memory medium, this program, when performing, can comprise the flow process of the embodiment as above-mentioned each side method.Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-OnlyMemory, ROM) or random store-memory body (RandomAccessMemory, RAM) etc.

The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. a data prediction sorting technique, is characterized in that, comprising:

Random forest is set up, i.e. many decision trees by training dataset;

2. data prediction sorting technique according to claim 1, is characterized in that, describedly sets up random forest by training dataset and comprises:

Privacy budget B is all given t tree, t is the quantity of the decision tree generated in random forest;

Each decision tree is recursively generated according to same rule.

3. data prediction sorting technique according to claim 2, is characterized in that, described generation decision tree comprises:

From training dataset S, random selecting size is | training set S (i) of S|;

Every one deck is given in the privacy budget that every is set, and the privacy budget of every one deck is divided into two halves, and half is used for estimating instance number, and second half is used for estimating class counting or evaluation attribute;

Recursively call the function generating decision tree;

First use Laplce's mechanism to add example number to present node to make an uproar;

Judge whether to reach end condition, if reach, to this leaf node mark classification, application Laplce mechanism adds to classification counting of making an uproar, if do not reach end condition, select f attribute at random in first dependency collection F, wherein, the size of f is got if have connection attribute in the attribute chosen, then divide a part of privacy budget first to each connection attribute, in order to select the split point of each connection attribute, from all properties, select Split Attribute afterwards;

The decision tree meeting ε-difference privacy is generated according to said process.

4. data prediction sorting technique according to claim 3, it is characterized in that, select all to use index mechanism to select when split point and Split Attribute, the scoring functions employing information gain of described index mechanism and maximum kind frequency and two kinds of modes, the susceptibility of scoring functions is respectively log ₂| C| and 1, wherein | C| is the size of categorical attribute collection.

5. data prediction sorting technique according to claim 4, is characterized in that, the decision tree in the described random forest utilizing foundation carries out prediction classification to test data set, obtains the classification results meeting difference privacy and comprises:

To the record of each in test set, every one tree in application forest carries out classification prediction to it, all judge which child node this record should enter according to the classification results set of present node on each node, until arrive certain leaf node, a predicted value is obtained by current leaf node, obtain that classification results of all middle maximum probabilities that predicts the outcome according to predicting the outcome of every tree in forest, export the classification results of all records.

6. a data prediction sorter, is characterized in that, comprising:

7. data prediction sorter according to claim 6, is characterized in that, describedly sets up unit, and also for privacy budget B is all given t tree, t is the quantity of the decision tree generated in random forest; Each decision tree is recursively generated according to same rule.

8. data prediction sorter according to claim 7, is characterized in that, describedly sets up unit, for random selecting size from training dataset S is also | training set S (i) of S|; Every one deck is given in the privacy budget that every is set, and the privacy budget of every one deck is divided into two halves, and half is used for estimating instance number, and second half is used for estimating class counting or evaluation attribute; Recursively call the function generating decision tree; First use Laplce's mechanism to add example number to present node to make an uproar; Judge whether to reach end condition, if reach, to this leaf node mark classification, application Laplce mechanism adds to classification counting of making an uproar, if do not reach end condition, select f attribute at random in first dependency collection F, wherein, the size of f is got if have connection attribute in the attribute chosen, then divide a part of privacy budget first to each connection attribute, in order to select the split point of each connection attribute, from all properties, select Split Attribute afterwards; The decision tree meeting ε-difference privacy is generated according to said process.

9. data prediction sorter according to claim 8, it is characterized in that, select all to use index mechanism to select when split point and Split Attribute, the scoring functions employing information gain of described index mechanism and maximum kind frequency and two kinds of modes, the susceptibility of scoring functions is respectively log ₂| C| and 1, wherein | C| is the size of categorical attribute collection.

10. data prediction sorter according to claim 9, it is characterized in that, described prediction taxon, for to the record of each in test set, every one tree in application forest carries out classification prediction to it, all judge which child node this record should enter according to the classification results set of present node on each node, until arrive certain leaf node, a predicted value is obtained by current leaf node, that classification results of all middle maximum probabilities that predicts the outcome is obtained according to predicting the outcome of every tree in forest, export the classification results of all records.