CN112560105B

CN112560105B - Joint modeling method and device for protecting multi-party data privacy

Info

Publication number: CN112560105B
Application number: CN202110188950.1A
Authority: CN
Inventors: 黄诤杰; 谭潇; 陈帅
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2021-02-19
Filing date: 2021-02-19
Publication date: 2021-09-07
Anticipated expiration: 2041-02-19
Also published as: CN113821827A; CN113821827B; CN112560105A

Abstract

The embodiment of the specification provides a joint modeling method for protecting privacy of multi-party data, wherein each of multiple parties stores a training sample set, and each training sample has a characteristic value corresponding to a plurality of characteristic items and a label value corresponding to a label item; the method is applied to any first party, and comprises the following steps: determining a plurality of first relevance degrees between the plurality of feature items and the label item based on a first training sample set, obtaining a plurality of second relevance degrees determined by a second party, and further determining the difference degree between the corresponding first relevance degree and the second relevance degree of each feature item to obtain a plurality of difference degrees; determining a plurality of first importance weights of the plurality of feature items in a first tree model constructed by utilizing a first training sample set, and performing weighting processing on the plurality of difference degrees by utilizing the first importance weights to obtain a feature distribution difference score; in the event that this feature distribution difference score meets a predetermined condition, the second party is classified as a participant in joint modeling with the first party.

Description

Joint modeling method and device for protecting multi-party data privacy

Technical Field

The embodiment of the specification relates to the technical field of machine learning, in particular to a joint modeling method and device for protecting multi-party data privacy.

Background

The development of computer technology has enabled machine learning to be more and more widely applied in various business scenarios. Federated learning is a method of joint modeling with protection of private data. For example, enterprises need to perform collaborative security modeling, and federal learning can be performed, so that data of all parties are used for performing collaborative training on a data processing model on the premise of sufficiently protecting enterprise data privacy, and business data are processed more accurately and effectively. In a federal learning scenario, after each party can agree on a model structure (or an agreed model), each party can use private data to train locally, model parameters are aggregated by using a safe and reliable method, and finally, each party improves a local model according to the aggregated model parameters. On the basis of privacy protection, federated learning effectively breaks a data island and achieves multiparty combined modeling.

However, current federal learning approaches perform poorly in training efficiency. Therefore, a solution is needed to improve the training efficiency of federal learning while ensuring the training effect of federal learning.

Disclosure of Invention

In the joint modeling method and device for protecting multi-party data privacy described in the embodiments of the present specification, the joint modeling effect is ensured and the training efficiency of joint modeling is improved by effectively screening the joint modeling participants.

According to a first aspect, a joint modeling method for protecting privacy of data of multiple parties is provided, wherein the multiple parties respectively store training sample sets, and each training sample has feature values corresponding to multiple feature items and label values corresponding to label items; the method is applied to any first party, and comprises the following steps:

determining the association degree between each feature item in the feature items and the label item based on a first training sample set to obtain a plurality of first association degrees; obtaining a plurality of second relevance degrees determined by the second party based on the second training sample set; determining the difference degree between the first relevance degree and the second relevance degree corresponding to each feature item to obtain a plurality of difference degrees corresponding to the plurality of feature items; constructing a first tree model by using the first training sample set; determining a plurality of first importance weights for the plurality of feature terms in the first tree model; weighting the plurality of difference degrees by using the plurality of first importance weights to obtain a feature distribution difference score; and in the case that the feature distribution difference score meets a preset condition, the second party is classified as a party performing joint modeling with the first party.

In one embodiment, determining the association degree between each feature item in the plurality of feature items and the label item based on a first training sample set to obtain a plurality of first association degrees includes: for each feature item, performing binning processing on a plurality of feature values corresponding to the feature item in the first training sample set to obtain a binning result, wherein the binning result includes mapping relationships between the plurality of feature values and a plurality of binning categories; for each of the plurality of binning classes, determining a sample distribution thereof corresponding to a different label value in the first training sample set; and calculating chi-square test values of the corresponding characteristic items according to a plurality of sample distributions corresponding to the plurality of box classification categories to serve as first association degrees.

In one embodiment, determining, for each feature item, a difference between the first relevance degree and the second relevance degree corresponding to the feature item, and obtaining a plurality of difference degrees corresponding to the plurality of feature items includes: and determining the absolute difference between the corresponding first relevance and second relevance of each feature item as the difference.

In one embodiment, determining a plurality of first importance weights for the plurality of feature items in the first tree model comprises: determining the times of the feature items being taken as splitting features in the first tree model; and carrying out normalization processing on a plurality of times corresponding to the plurality of feature items to obtain a plurality of first importance weights.

In one embodiment, in a case where the feature distribution difference score meets a predetermined condition, attributing the second party as a participant for joint modeling with the first party includes: in an instance in which the feature distribution difference score is greater than a predetermined threshold, attributing the second party as a participant for joint modeling with the first party.

In one embodiment, the method further comprises: acquiring importance weights of the plurality of feature items, which are determined by a plurality of third parties based on a local training sample set; determining a comprehensive importance weight of each feature item based on the acquired importance weight and the plurality of first importance weights; selecting a part of feature items from the plurality of feature items based on the comprehensive importance weight; sending the partial feature item to the participant to cause the participant to jointly model with the first party based on the partial feature item.

In a specific embodiment, selecting a partial feature item from the plurality of feature items based on the composite importance weight includes: ranking the plurality of feature items based on the composite importance weight; and taking the feature items with the ranking within a preset range as the partial feature items.

According to a second aspect, there is provided a joint modeling apparatus for protecting privacy of data of multiple parties, the multiple parties each storing a training sample set, wherein each training sample has feature values corresponding to a plurality of feature items and tag values corresponding to tag items; the apparatus is integrated in any first party, comprising:

the association degree determining unit is configured to determine association degrees between each feature item in the plurality of feature items and the label item based on a first training sample set to obtain a plurality of first association degrees; the association degree obtaining unit is configured to obtain a plurality of second association degrees determined by the second party based on the second training sample set; the difference degree determining unit is configured to determine, for each feature item, a difference degree between a first relevance degree and a second relevance degree corresponding to the feature item, so as to obtain a plurality of difference degrees corresponding to the plurality of feature items; a tree model construction unit configured to construct a first tree model using the first training sample set; a weight determination unit configured to determine a plurality of first importance weights of the plurality of feature items in the first tree model; the score determining unit is configured to perform weighting processing on the plurality of difference degrees by using the plurality of first importance weights to obtain a feature distribution difference score; and the participant screening unit is configured to classify the second party as a participant for joint modeling with the first party when the feature distribution difference score meets a preset condition.

In one embodiment, the association degree determining unit is specifically configured to: for each feature item, performing binning processing on a plurality of feature values corresponding to the feature item in the first training sample set to obtain a binning result, wherein the binning result includes mapping relationships between the plurality of feature values and a plurality of binning categories; for each of the plurality of binning classes, determining a sample distribution thereof corresponding to a different label value in the first training sample set; and calculating chi-square test values of the corresponding characteristic items according to a plurality of sample distributions corresponding to the plurality of box classification categories to serve as first association degrees.

In one embodiment, the difference degree determining unit is specifically configured to: and determining the absolute difference between the corresponding first relevance and second relevance of each feature item as the difference.

In one embodiment, the weight determination unit is specifically configured to: determining the times of the feature items being taken as splitting features in the first tree model; and carrying out normalization processing on a plurality of times corresponding to the plurality of feature items to obtain a plurality of first importance weights.

In one embodiment, the participant screening element is specifically configured to: in an instance in which the feature distribution difference score is greater than a predetermined threshold, attributing the second party as a participant for joint modeling with the first party.

In one embodiment, the apparatus further comprises a feature screening unit configured to: acquiring importance weights of the plurality of feature items, which are determined by a plurality of third parties based on a local training sample set; determining a comprehensive importance weight of each feature item based on the acquired importance weight and the plurality of first importance weights; selecting a part of feature items from the plurality of feature items based on the comprehensive importance weight; sending the partial feature item to the participant to cause the participant to jointly model with the first party based on the partial feature item.

According to a third aspect, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in the first aspect.

According to a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and the processor, when executing the executable code, implements the method described in the first aspect.

In the method and the device disclosed in the embodiments of the present disclosure, some data parties with similar feature distributions may be discarded by screening the participants of the joint modeling, so that a better model effect may be obtained by using the screened participants and the first party to perform the joint modeling, compared to the case where a plurality of data parties all participate in the joint modeling.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings needed to be used in the description of the embodiments will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments disclosed in the present specification, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 illustrates a scenario framework diagram of federated modeling to protect privacy of multi-party data, according to one embodiment;

FIG. 2 illustrates a flow diagram of a federated modeling method that protects multi-party data privacy in accordance with one embodiment;

FIG. 3 illustrates a decision tree included in a tree model according to one embodiment;

FIG. 4 illustrates a block diagram of a device architecture for federated modeling to protect privacy of multi-party data, according to one embodiment.

Detailed Description

Embodiments disclosed in the present specification are described below with reference to the accompanying drawings.

As mentioned above, there is a need for a solution that can improve the training efficiency of federal learning while ensuring the training effect of federal learning. Based on the above, the embodiment of the present specification discloses a joint modeling method for protecting multi-party data privacy, and in particular, fig. 1 shows a scenario framework diagram of joint modeling for protecting multi-party data privacy according to an embodiment. As shown in fig. 1, the framework includes a data preparation stage and a modeling stage of a data screening stage, in the data preparation stage, a plurality of data parties (or simply, parties) each perform data preparation, including preprocessing local data, such as feature item alignment, and then calculating feature distribution and feature importance based on preprocessed training data; in the data screening stage, an initiator (which can be any one of multiple parties) determines the locally calculated feature distribution and the difference between the feature distributions calculated by other data parties, and calculates a difference score based on the difference and the locally calculated feature importance so as to determine multiple participants of the joint modeling; in the modeling stage, the initiator and the multiple participants perform combined modeling to obtain a combined training model. Therefore, through reasonably screening the participants of the joint modeling, the training efficiency of the joint modeling can be effectively improved while the training effect of the joint modeling is ensured.

The following describes the steps of the above method disclosed in the embodiments of the present disclosure.

FIG. 2 illustrates a flow diagram of a joint modeling method to protect privacy of data of multiple parties, where the multiple parties each store a set of training samples, where each training sample has feature values corresponding to multiple feature items and tag values corresponding to tag items, according to one embodiment.

In one embodiment, the business objects targeted by the training sample set may include users, businesses, commodities, events, and the like, in other words, the samples included in the training sample set may be user samples, business samples, commodity samples, or event samples. Further, in a particular embodiment, the event can be a transaction event, a login event, a download event, a registration event, a complaint event, and the like.

In an embodiment, each training sample belongs to a user sample, the plurality of corresponding feature items may include gender, age, occupation, frequent residence, amount of consumption, frequency of consumption, liveness (e.g., frequency of logging in a certain e-commerce platform, duration of use), and the like, and the corresponding label item may be a user group label, a user risk label, or the like. In a specific embodiment, the tag values corresponding to the user group tags may include a high consumer group, a medium consumer group, a low consumer group, and the like. In a specific embodiment, the tag values corresponding to the user risk tags may include high risk groups, low risk groups, and the like.

In one embodiment, each training sample belongs to a commodity sample, the plurality of corresponding characteristic items may include a commodity origin, a price, a class, an expiration date, a package, a sales volume, and the like, and the corresponding label item may be a commodity popularity label or a commodity target group label. In a specific embodiment, the tag value corresponding to the article hotness tag may include an explosive article, a hot article, a cold article, and the like. In a particular embodiment, the merchandise target group indicia may include students, office workers, parents, elderly people, and the like.

In one embodiment, each of the training samples belongs to an event sample, the plurality of feature items corresponding to each training sample may include time of occurrence of the event, a network address, a geographic location, a related amount, and the like, and the label item corresponding to each training sample may be an event risk label. In a particular embodiment, the tag values of the event risk tags may include high risk, medium risk, low risk, and the like.

The training sample set and the feature items, the label items and the label values included in the training samples are introduced above. In addition, the above method may be applied to any one of the data parties, and for brevity, is referred to herein as the first party (or first data party). Also, the first data party may be implemented as any platform, server, or cluster of devices having computing, processing, and storage capabilities. As shown in fig. 2, the method comprises the steps of:

step S210, determining the association degree between each feature item in the feature items and the label item based on a first training sample set to obtain a plurality of first association degrees; step S220, a plurality of second relevance degrees determined by the second party based on the second training sample set are obtained; step 230, determining a difference between the first relevance degree and the second relevance degree corresponding to each feature item to obtain a plurality of difference degrees corresponding to the plurality of feature items; step S240, constructing a first tree model by utilizing the first training sample set; step S250, determining a plurality of first importance weights of the plurality of feature items in the first tree model; step S260, weighting the plurality of difference degrees by using the plurality of first importance weights to obtain a feature distribution difference score; and step S270, when the feature distribution difference score meets a predetermined condition, classifying the second party as a party performing joint modeling with the first party.

In view of the above steps, it should be noted that, the terms "first" and "second" in the "first training sample set", "first degree of association", and "first tree model", and the like, and the terms like, are used to simply distinguish the same kind of things, and do not have other limiting effects such as ordering.

The steps are expanded as follows:

first, in step S210, based on a first training sample set, association degrees between each feature item in the plurality of feature items and the tag item are determined, so as to obtain a plurality of first association degrees. It should be noted that the training sample set stored locally in the first data party is referred to as a first training sample set. Furthermore, the plurality of first degrees of association obtained in this step may form a feature distribution of the first party for the plurality of feature items.

In one embodiment, this step may include: for each feature item, performing binning processing on a plurality of feature values corresponding to the feature item in the first training sample set to obtain a binning result, wherein the binning result includes mapping relationships between the plurality of feature values and a plurality of binning categories; then, determining sample distribution of different label values corresponding to each of the plurality of classification classes in the first training sample set; and calculating chi-square test values of the corresponding characteristic items according to a plurality of sample distributions corresponding to the plurality of box classification categories to serve as first relevance degrees.

It is to be understood that, for a first feature item of any of the plurality of feature items described above, each training sample includes a feature value corresponding to the first feature item, and thus, the first training set relates to a plurality of first training samples, which respectively include a plurality of feature values corresponding to the first feature item.

In the binning process, simply, the binning is to discretize continuous variables and merge multi-state discrete variables into fewer states. The box separation mode has various modes, including equal-frequency box separation, equidistant box separation, clustering box separation, Best-KS box separation, chi-square box separation and the like. It should be noted that the binning modes used for any two feature items may be the same or different.

According to an example, assuming that the above-mentioned plurality of feature items include annual income, and the plurality of feature values corresponding to the annual income in the first training sample set include 12, 20, 32, 45, 55, 60 (unit: ten thousand), further assuming that a binning manner of equidistant binning is adopted, the binning results shown in table 1 below can be obtained.

TABLE 1

As described above, the binning class corresponding to each feature value can be obtained through the binning processing, and for example, the binning class corresponding to the feature value 12 in table 1 is low income.

Further, for each of the plurality of bin categories, a sample distribution of the bin category in the first training sample set corresponding to different label values is determined. It should be noted that the sample distribution may include the number of samples corresponding to different label values in the training samples having the classification under each classification. In one example, assuming that the label items are user population categories, the corresponding label values include low consumer population and high consumer population, and the sample distribution determined based on the binning results in table 1 is shown in table 2 below.

TABLE 2

In the above, a plurality of sample distributions under a plurality of binning categories for any feature item (such as annual income) can be counted. Then, based on the plurality of sample distributions, a chi-squared test value of the corresponding feature item may be calculated as a first degree of association. In one particular embodiment, the chi-square test value is calculated using a chi-square test, wherein the chi-square test value may be calculated using the following equation (1).

（1）

In the formula (1), the first and second groups,

representing chi-square test value, or chi-square statistic;

the value of the observed value is represented,

indicating the expected value.

According to an example, expected values corresponding to the respective elements may be determined based on the observed values in table 2. Specifically, assuming that the consumption level of a person is not related to the income level thereof, it can be seen from the contents of the rightmost column in table 2 that the person has a probability of 40/100=0.4 belonging to the low consumption group and a probability of 60/100=0.6 belonging to the high consumption group, and based on these two probabilities, the expectation value shown in table 3 below can be calculated.

TABLE 3

Further, from the observed values and the expected values shown in table 3, in combination with formula (1), the chi-square test value corresponding to the feature item "annual income" can be calculated to be 0.72, and this can be used as the first degree of association between the feature item "annual income" and the tag item "user group tag".

From the above, through the binning processing and the chi-square inspection, the first association degree between each feature item and the tag item can be obtained. In another embodiment, a spearman (spearman) correlation coefficient between each feature item and the tag item can also be calculated as the corresponding first degree of association.

From the above, a plurality of first association degrees between the feature items and the label items can be obtained based on the first training sample set. On the other hand, in step S220, a plurality of second degrees of association determined by the second party based on the second training sample set is obtained. The second party may be any one of the data parties except the first party, a training sample set locally stored by the second party is referred to as a second training sample set, and a plurality of association degrees between the plurality of feature items and the label items determined by the second party based on the second training sample set are referred to as a plurality of second association degrees. The manner in which the second party determines the second degree of association may be the same as the manner in which the first party determines the first degree of association, for example, the manner in which the degree of association is determined is defined in advance by each of the parties, and then the degree of association is determined based on the defined manner.

In one embodiment, this step may include: the plurality of second degrees of association are received from the second party. In another embodiment, this step may include: and acquiring the plurality of second association degrees from the central server, and uploading the plurality of second association degrees to the central server by a second party.

In this way, a plurality of second degrees of association can be obtained.

Next, based on the determined first relevance degrees and the acquired second relevance degrees, in step S230, a difference degree between the corresponding first relevance degree and second relevance degree is determined for each feature item, and a plurality of difference degrees corresponding to the feature items are obtained. It should be noted that the difference is used to reflect the difference between the first relevance and the second relevance, and the specific calculation manner may be preset. In one embodiment, for each feature item, an absolute value of a difference between a first relevance degree and a second relevance degree corresponding to the feature item is determined as a difference degree. In another embodiment, for each feature item, the absolute value of the square difference between the first relevance and the second relevance corresponding to the feature item is determined as the difference. In a further embodiment, the difference between the first degree of association and the second degree of association, and the ratio of the first degree of association, is determined as the degree of difference.

As such, a plurality of degrees of difference may be determined based on the plurality of first degrees of association and the plurality of second degrees of association for the plurality of feature items.

On the other hand, in step S240, a first tree model is constructed using the first training sample set. To facilitate understanding, the tree model created may include a plurality of decision trees, and in one embodiment, fig. 3 illustrates a decision tree included in the tree model according to one embodiment, including a root node 31 and a plurality of leaf nodes (e.g., leaf nodes 35), and including a plurality of parent nodes (e.g., parent nodes 32) between the root node and each leaf node. Further, the root node 31 corresponds to the first training sample set, and the samples in the first training sample set may be divided into a certain leaf node through a predicted path in the decision tree, where the predicted path refers to a node connection path from the corresponding leaf node to the root node of the decision tree where the corresponding leaf node is located (a predicted path is shown in fig. 3 by bold), and each parent node has a corresponding splitting feature and a splitting value, where the splitting feature is a certain feature in the feature items. Taking parent node 32 as an example, its corresponding splitting characteristic and splitting value are respectively expressed as

And

for a certain training sample, it corresponds to a split feature

If the characteristic value is less than

(Y is the judgment result at this moment), the tree is divided into a left sub-tree, if the tree is not less than

(in this case, the judgment result is N), the right subtree is divided.

The tree model constructed above is introduced. The first tree model is obtained by training according to the first training sample set, and specifically, the split feature and the split value of the decision tree included in the first tree model are obtained by selecting and calculating based on a plurality of feature items related to the first training sample set and a plurality of feature values corresponding to the feature items. In one embodiment, the algorithm based on which the first Tree model is based may be a GBDT (Gradient boosting decision Tree) algorithm, an xgboost (extremegradientgressing) algorithm, a CART (Classification And Regression Tree) algorithm, or the like.

The first tree model can be obtained through training. Further, based on the first tree model, in step S250, a plurality of first importance weights of the plurality of feature items in the first tree model are determined.

In one embodiment, this step may be implemented as: the times of taking each feature item as the splitting feature in the first tree model can be determined, and a plurality of times corresponding to the feature items are obtained; and then, normalizing the multiple times to obtain the multiple first importance weights. In a particular embodiment, wherein the normalization process may include dividing each degree by a sum of the plurality of degrees. In another specific embodiment, the normalization process can be implemented by utilizing a softmax function. In one example, see FIG. 3, assume that a feature item

、

、

And

the numbers of occurrences in the tree model (which may include multiple decision trees) are 5, 10, 8, and 2, respectively, and the ratios of the numbers are 0.2, 0.4, 0.32, and 0.08.

In another embodiment, this step may be implemented as: and according to the decision path of the training samples in the first training sample set in the first tree model, counting the number of samples passing through each father node, accumulating the number of samples corresponding to each splitting feature according to the number of samples, further obtaining a plurality of sample numbers corresponding to the plurality of feature items, and performing normalization processing on the plurality of sample numbers to serve as the plurality of first importance weights.

Therefore, a plurality of first importance weights corresponding to the plurality of feature items can be determined. Accordingly, in step S260, the plurality of difference degrees are weighted by the plurality of first importance weights, and a feature distribution difference score is obtained. In a particular embodiment, wherein the weighting process comprises weighted summation. In another specific embodiment, wherein the weighting process includes weighted summation of absolute values of the plurality of degrees of difference. In a further specific embodiment, the weighting process includes multiplying the result of the weighted summation by a preset scaling factor (e.g., 0.5 or 2). In one example, assuming that the plurality of first importance weights corresponding to the plurality of feature items are 0.2, 0.4, 0.32, and 0.08, respectively, and the plurality of differences are 3.2, 2.6, 4.8, and 0.76, respectively, weighted summation thereof can calculate the feature distribution difference score to be 17.10.

In this manner, a feature distribution difference score between the first party and the second party may be obtained. Then, in step S270, in the case where the feature distribution difference score meets a predetermined condition, the second party is classified as a party that performs joint modeling with the first party.

In one embodiment, this step may be implemented as: and in the case that the feature distribution difference score is larger than a preset threshold value, the second party is classified as a party for joint modeling with the first party. Note that the predetermined threshold may be set by a worker based on experience, and may be set to 10 or 20, for example. Further, in the case where the above-described feature difference scores are not larger than the predetermined threshold values, the second party is not classified as the participating party.

In another embodiment, before step S270, the method may further include: determining feature distribution difference scores between a plurality of parties other than the first party and the second party and the first party, and ranking a plurality of feature distribution difference scores corresponding to the second party and the plurality of parties together, wherein the step can be implemented as follows: in the case where the ranking of the particular distribution difference score for the second party is within a predetermined range (e.g., the top 10 or top 5), the second party is classified as a participant for joint modeling with the first party. Otherwise, it is not classified as. It is to be noted that "a plurality" in this specification means one or more.

In this way, the participators which carry out joint modeling with the first party can be screened out from the plurality of data parties. It will be appreciated that the final determined number of participants may be one or more. By adopting the steps S210 to S270, the participants of the joint modeling are screened, and some data parties with similar feature distribution can be discarded, so that compared with the situation that a plurality of data parties all participate in the joint modeling, better model effect can be obtained by using the screened participants and the first party to perform the joint modeling.

According to another embodiment, as shown in fig. 1, the method may further include screening the feature items. In one embodiment, the screening of the feature items may be implemented as: acquiring importance weights of the plurality of feature items, which are determined by a plurality of third parties based on a local training sample set respectively; determining a comprehensive importance weight of each feature item based on the acquired importance weight and the plurality of first importance weights; then, based on the comprehensive importance weight, selecting partial characteristic items from the plurality of characteristic items; then, the partial feature item is sent to the participant, so that the participant performs joint modeling with the first party based on the partial feature item.

It should be noted that, a plurality of third parties belong to other data parties except the first party, and the plurality of third parties may or may not include the second party. In addition, for the way in which the plurality of third parties determine the importance weights for the plurality of feature items, reference may be made to the foregoing description of determining the plurality of first importance weights by the first party, which is not repeated herein.

For the determination of the above-mentioned comprehensive importance weight, in a specific embodiment, the determination may include: and adding and accumulating the importance weights corresponding to the characteristic items based on the acquired importance weights and the plurality of first importance weights, and taking the accumulated total weight as the comprehensive importance weight of the corresponding characteristic item. In another specific embodiment, the average value of the accumulated total weights can be used as the comprehensive importance weight of the corresponding feature item.

For the selection of the partial feature items, in a specific embodiment, the feature items may be ranked based on a plurality of comprehensive importance weights corresponding to the feature items, and then the feature items with the ranking within a predetermined range may be used as the partial feature items. In another specific embodiment, normalization processing may be performed on the multiple comprehensive importance weights, and then the feature item corresponding to the value greater than the preset threshold in the normalization processing result is used as the partial feature item.

From the above, the selection of the feature item can be completed. Furthermore, the first party and the determined parties can perform combined modeling based on the optimized partial feature items, so that the efficiency of combined modeling can be further improved, and meanwhile, the good effect of model training is ensured.

According to an embodiment of still another aspect, after the step S270, the method may further include: the first party models jointly with several data parties classified as participants. In a specific embodiment, the models used for the joint modeling may be secureboost, DNN (Deep Neural Networks), CNN (Convolutional Neural Networks), RNN (recurrent Neural Networks), and the like. On the other hand, in a specific embodiment, a trusted central server may be introduced to communicate with the first party and each of the participants, so as to aggregate the training gradients locally determined by each of the participants, thereby obtaining a final trained model. In another specific embodiment, instead of introducing other processing parties except the first party and the participating parties, the secure data communication between the first party and the participating parties may be realized based on a secure multi-party computing MPC technique, such as a dynamic encryption method, so as to obtain the final trained model. Therefore, federated learning can be achieved, and machine learning models which are jointly trained by the first party and all the participants based on the local training sample set are obtained.

Corresponding to the above joint modeling method, fig. 4 is a diagram illustrating a structure of a joint modeling apparatus for protecting privacy of data of multiple parties, each of which stores a training sample set, wherein each training sample has feature values corresponding to a plurality of feature items and label values corresponding to label items; the device is integrated into either first party. As shown in fig. 4, the apparatus 400 includes the following components:

an association degree determining unit 410 configured to determine, based on a first training sample set, an association degree between each feature item of the plurality of feature items and the tag item, so as to obtain a plurality of first association degrees; an association obtaining unit 420 configured to obtain a plurality of second associations determined by the second party based on the second training sample set; a difference determining unit 430, configured to determine, for each feature item, a difference between a first relevance and a second relevance corresponding to the feature item, so as to obtain a plurality of differences corresponding to the plurality of feature items; a tree model construction unit 440 configured to construct a first tree model using the first training sample set; a weight determining unit 450 configured to determine a plurality of first importance weights of the plurality of feature items in the first tree model; a score determining unit 460, configured to perform weighting processing on the plurality of difference degrees by using the plurality of first importance weights, so as to obtain a feature distribution difference score; a participant screening unit 470 configured to classify the second party as a participant for joint modeling with the first party if the feature distribution difference score meets a predetermined condition.

In an embodiment, the association degree determining unit 410 is specifically configured to: for each feature item, performing binning processing on a plurality of feature values corresponding to the feature item in the first training sample set to obtain a binning result, wherein the binning result includes mapping relationships between the plurality of feature values and a plurality of binning categories; for each of the plurality of binning classes, determining a sample distribution thereof corresponding to a different label value in the first training sample set; and calculating chi-square test values of the corresponding characteristic items according to a plurality of sample distributions corresponding to the plurality of box classification categories to serve as first association degrees.

In an embodiment, the difference degree determining unit 430 is specifically configured to: and determining the absolute difference between the corresponding first relevance and second relevance of each feature item as the difference.

In one embodiment, the weight determining unit 450 is specifically configured to: determining the times of the feature items being taken as splitting features in the first tree model; and carrying out normalization processing on a plurality of times corresponding to the plurality of feature items to obtain a plurality of first importance weights.

In an embodiment, the participant screening unit 470 is specifically configured to: in an instance in which the feature distribution difference score is greater than a predetermined threshold, attributing the second party as a participant for joint modeling with the first party.

In one embodiment, the apparatus 400 further comprises a feature screening unit 480 configured to: acquiring importance weights of the plurality of feature items, which are determined by a plurality of third parties based on a local training sample set; determining a comprehensive importance weight of each feature item based on the acquired importance weight and the plurality of first importance weights; selecting a part of feature items from the plurality of feature items based on the comprehensive importance weight; sending the partial feature item to the participant to cause the participant to jointly model with the first party based on the partial feature item.

In a specific embodiment, the feature filtering unit 480 is further configured to: ranking the plurality of feature items based on the composite importance weight; and taking the feature items with the ranking within a preset range as the partial feature items.

As above, according to an embodiment of a further aspect, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.

There is also provided, according to an embodiment of yet another aspect, a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the embodiments disclosed in the present specification are further described in detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the embodiments disclosed in the present specification, and are not intended to limit the scope of the embodiments disclosed in the present specification, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the embodiments disclosed in the present specification should be included in the scope of the embodiments disclosed in the present specification.

Claims

1. A joint modeling method for protecting privacy of multi-party data is disclosed, wherein, a plurality of parties respectively store training sample sets, wherein each training sample has characteristic values corresponding to a plurality of characteristic items and label values corresponding to label items; the method is applied to any first party, and comprises the following steps:

determining the association degree between each feature item in the feature items and the label item based on a first training sample set to obtain a plurality of first association degrees;

acquiring a plurality of second relevance degrees determined by a second party based on a second training sample set, wherein the mode for determining the second relevance degrees by the second party is the same as the mode for determining the first relevance degrees by the first party;

determining the difference degree between the first relevance degree and the second relevance degree corresponding to each feature item to obtain a plurality of difference degrees corresponding to the plurality of feature items;

constructing a first tree model by using the first training sample set; determining a plurality of first importance weights for the plurality of feature terms in the first tree model;

weighting the plurality of difference degrees by using the plurality of first importance weights to obtain a feature distribution difference score;

in an instance in which the feature distribution difference score is greater than a predetermined threshold, attributing the second party as a participant for joint modeling with the first party.

2. The method of claim 1, wherein determining a degree of association between each of the plurality of feature items and the label item based on a first training sample set, resulting in a plurality of first degrees of association, comprises:

for each feature item, performing binning processing on a plurality of feature values corresponding to the feature item in the first training sample set to obtain a binning result, wherein the binning result includes mapping relationships between the plurality of feature values and a plurality of binning categories;

for each of the plurality of binning classes, determining a sample distribution thereof corresponding to a different label value in the first training sample set;

and calculating chi-square test values of the corresponding characteristic items according to a plurality of sample distributions corresponding to the plurality of box classification categories to serve as first association degrees.

3. The method according to claim 1, wherein determining, for each feature item, a difference between the first relevance and the second relevance to which the feature item corresponds to obtain a plurality of differences corresponding to the plurality of feature items comprises:

and determining the absolute difference between the corresponding first relevance and second relevance of each feature item as the difference.

4. The method of claim 1, wherein determining a plurality of first importance weights for the plurality of feature terms in the first tree model comprises:

determining the times of the feature items being taken as splitting features in the first tree model;

and carrying out normalization processing on a plurality of times corresponding to the plurality of feature items to obtain a plurality of first importance weights.

5. The method of claim 1, wherein the method further comprises:

acquiring importance weights of the plurality of feature items, which are determined by a plurality of third parties based on a local training sample set;

determining a comprehensive importance weight of each feature item based on the acquired importance weight and the plurality of first importance weights;

selecting a part of feature items from the plurality of feature items based on the comprehensive importance weight;

sending the partial feature item to the participant to cause the participant to jointly model with the first party based on the partial feature item.

6. The method of claim 5, wherein selecting a partial feature item from the plurality of feature items based on the composite importance weight comprises:

ranking the plurality of feature items based on the composite importance weight;

and taking the feature items with the ranking within a preset range as the partial feature items.

7. A joint modeling device for protecting privacy of multi-party data, wherein each of the multiple parties stores a training sample set, and each training sample has characteristic values corresponding to a plurality of characteristic items and label values corresponding to label items; the apparatus is integrated in any first party, comprising:

the association degree determining unit is configured to determine association degrees between each feature item in the plurality of feature items and the label item based on a first training sample set to obtain a plurality of first association degrees;

the association degree obtaining unit is configured to obtain a plurality of second association degrees determined by a second party based on a second training sample set, wherein the manner for determining the second association degrees by the second party is the same as the manner for determining the first association degrees by the first party;

the difference degree determining unit is configured to determine, for each feature item, a difference degree between a first relevance degree and a second relevance degree corresponding to the feature item, so as to obtain a plurality of difference degrees corresponding to the plurality of feature items;

a tree model construction unit configured to construct a first tree model using the first training sample set;

a weight determination unit configured to determine a plurality of first importance weights of the plurality of feature items in the first tree model;

the score determining unit is configured to perform weighting processing on the plurality of difference degrees by using the plurality of first importance weights to obtain a feature distribution difference score;

a participant screening unit configured to classify the second party as a participant for joint modeling with the first party if the feature distribution difference score is greater than a predetermined threshold.

8. The apparatus according to claim 7, wherein the association degree determining unit is specifically configured to:

9. The apparatus according to claim 7, wherein the disparity determining unit is specifically configured to:

10. The apparatus according to claim 7, wherein the weight determination unit is specifically configured to:

11. The apparatus of claim 7, wherein the apparatus further comprises a feature screening unit configured to:

12. The apparatus of claim 11, wherein the feature screening unit is further configured to:

13. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-6.

14. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that when executed by the processor implements the method of any of claims 1-6.