CN110162995B

CN110162995B - Method and device for evaluating data contribution degree

Info

Publication number: CN110162995B
Application number: CN201910323738.4A
Authority: CN
Inventors: 陈超超; 周俊
Original assignee: Advanced New Technologies Co Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2019-04-22
Filing date: 2019-04-22
Publication date: 2023-01-10
Anticipated expiration: 2039-04-22
Also published as: CN110162995A

Abstract

The application relates to the field of data sharing, and discloses a method and a device for evaluating data contribution degree. The method is performed by a first party and comprises: performing model training by using training data of a first party to obtain a first model; using self training data of a first party, and carrying out model training together with a second party based on a multi-party safety calculation mode to obtain a second model, wherein the second party provides self data in the process of carrying out model training based on the multi-party safety calculation mode and the first party; respectively obtaining an evaluation result of the first model and an evaluation result of the second model by using the test data of the first party; and evaluating the contribution degree of the second party data according to the promotion degree of the evaluation result of the second model relative to the evaluation result of the first model.

Description

Method and device for evaluating data contribution degree

Technical Field

The present application relates to the field of data sharing.

Background

Data sharing becomes the next arduous research and practicability problem, and refers to that a plurality of data parties jointly perform data mining or machine learning work under the condition of protecting respective data privacy so as to dig out greater value in data. Fig. 1 is a schematic diagram of a data sharing principle.

For example, there are three banks, each with a bank of credit investigation data for users, who want to jointly train a better credit investigation model for crediting the users. One practical problem that all parties can consider at this time is: the other party will not spoof himself with false data or low quality data. That is, when data is shared, the contribution degree of each party's data needs to be evaluated.

In the prior art, the following disadvantages exist in evaluating the contribution degree of each data in data sharing:

(1) The contribution degree of each party of data can be judged only by mixing the plaintext data of each party;

(2) The privacy of the data of the parties cannot be protected.

Disclosure of Invention

The present specification provides a method and an apparatus for evaluating a data contribution degree, which can evaluate the contribution degree of each party's data to a final service on the premise of protecting the privacy of each party's data.

To solve the above technical problem, an embodiment of the present specification discloses a method of evaluating a degree of contribution of data, the method performed by a first party, including:

performing model training by using training data of a first party to obtain a first model;

using self training data of a first party, and carrying out model training together with a second party based on a multi-party safety calculation mode to obtain a second model, wherein the second party provides self data in the process of carrying out model training based on the multi-party safety calculation mode and the first party;

respectively obtaining the evaluation result of the first model and the evaluation result of the second model by using the self test data of the first party;

and evaluating the contribution degree of the second party data according to the promotion degree of the evaluation result of the second model relative to the evaluation result of the first model.

Embodiments of the present specification also disclose an apparatus for evaluating a degree of data contribution, the apparatus for use with a first party, comprising:

the first training module is used for carrying out model training by using the training data of the first party to obtain a first model;

the second training module is used for using the self training data of the first party and carrying out model training together with the second party on the basis of a multi-party safety calculation mode to obtain a second model, wherein the second party provides self data in the process of carrying out model training on the basis of the multi-party safety calculation mode and the first party;

the first testing module is used for respectively obtaining the evaluation result of the first model and the evaluation result of the second model by using the testing data of the first party;

and the first evaluation module is used for evaluating the contribution degree of the second party data according to the promotion degree of the evaluation result of the second model relative to the evaluation result of the first model.

The embodiment of the present specification also discloses an apparatus for evaluating a data contribution degree, including:

a memory for storing computer executable instructions; and (c) a second step of,

a processor for implementing the steps of the above method when executing the computer executable instructions.

Embodiments of the present specification also disclose a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, implement the steps of the above-described method.

In the embodiment of the specification, plaintext data of each party does not need to be mixed, and the contribution degree of each party of data to the final service can be evaluated on the premise of protecting the data privacy of each party.

A large number of technical features are described in the specification of the present application, and are distributed in various technical solutions, so that the specification is too long if all possible combinations of the technical features (i.e., the technical solutions) in the present application are listed. In order to avoid this problem, the respective technical features disclosed in the above-mentioned summary of the invention of the present application, the respective technical features disclosed in the following embodiments and examples, and the respective technical features disclosed in the drawings may be freely combined with each other to constitute various new technical solutions (all of which are considered to have been described in the present specification) unless such a combination of the technical features is technically impossible. For example, in one example, the feature a + B + C is disclosed, in another example, the feature a + B + D + E is disclosed, and the features C and D are equivalent technical means for the same purpose, and technically only one feature is used, but not simultaneously employed, and the feature E can be technically combined with the feature C, then the solution of a + B + C + D should not be considered as being described because the technology is not feasible, and the solution of a + B + C + E should be considered as being described.

Drawings

FIG. 1 is a schematic illustration of a data sharing concept;

FIG. 2 is a flow chart illustrating a method for evaluating a degree of data contribution according to a first embodiment of the present application;

fig. 3 is a schematic structural diagram of an apparatus for evaluating a data contribution degree according to a third embodiment of the present application.

Detailed Description

In the following description, numerous technical details are set forth in order to provide a better understanding of the present application. However, it will be understood by those skilled in the art that the technical solutions claimed in the present application may be implemented without these technical details and with various changes and modifications based on the following embodiments.

Description of partial concepts:

and (3) secure data sharing: the method refers to a plurality of data parties which carry out data mining or machine learning work together under the condition of protecting the privacy of the respective data.

To make the objects, technical solutions and advantages of the present specification clearer, embodiments of the present specification will be described in further detail below with reference to the accompanying drawings.

A first embodiment of the present specification relates to a method for evaluating a data contribution degree, and a flow chart thereof is schematically shown in fig. 2.

First, it should be noted that the method is executed by the first party, that is, the method is a method for the first party to evaluate the data contribution degree of other parties.

As shown in fig. 2, the method for evaluating the degree of data contribution includes the following steps:

in step 201, model training is performed using the training data of the first party itself to obtain a first model.

That is, in step 201, a first party trains a model using its own training data to obtain a first model.

Then step 203 is entered, the training data of the first party is used, and model training is carried out together with the second party based on the multi-party security computing way, so as to obtain a second model, wherein the second party provides self data in the process of model training based on the multi-party security computing way and the first party.

That is, in step 203, the first party uses its own training data, and the second party uses its own data to obtain the second model by using a modeling method of data sharing (multi-party secure computation).

In this embodiment, preferably, the model is a logistic regression model. Further, a neural network model or a tree model, etc. may be used.

Multi-party security computing is a collaborative computing problem that addresses privacy protection among a group of mutually untrusted participants, such as co-training logistic regression models. The multi-party security calculation needs to ensure the independence of input and the correctness of calculation, and meanwhile, each input value is not leaked to other participants participating in the calculation. And after the calculation is completed, the results are given to the various participants.

The ways of multi-party secure computation can be mainly divided into three categories:

1. an obfuscation circuit;

2. homomorphic encryption;

3. and (4) secret sharing.

For example, a common logistic regression model, the three ways can be realized, and each has advantages and disadvantages. That is, in the present embodiment, the ways of multi-party secure computation may include the above three ways.

Secret sharing is a cryptographic technique for storing a secret in a split manner, and divides the secret into a plurality of secret shares in a proper manner, each secret share is owned and managed by one of a plurality of parties, a single party cannot recover the complete secret, and only a plurality of parties cooperate together can the complete secret be recovered. Secret sharing aims to prevent the secret from being too centralized so as to achieve the purposes of dispersing risks and tolerating intrusion.

Secret sharing can be roughly divided into two categories: there is trusted initializer secret sharing and untrusted initializer secret sharing. In secret sharing with a trusted initiator, the trusted initiator is required to perform parameter initialization (often to generate random numbers meeting certain conditions) on each participant participating in multi-party secure computation. After the initialization is completed, the trusted initialization party destroys the data and disappears at the same time, and the data are not needed in the following multi-party security calculation process.

Secret-sharing matrix multiplication with a trusted initiator is applicable to the following cases: the complete secret data is a product of the first set of secret shares and the second set of secret shares, and each participant has a first one of the first set of secret shares and a second one of the second set of secret shares. By the secret sharing matrix multiplication of the trusted initiator, each of the multiple participants can obtain partial complete secret data of the complete secret data, the sum of the partial complete secret data obtained by each participant is the complete secret data, and each participant discloses the obtained partial complete secret data to the rest of the participants, so that each participant can obtain the complete secret data without disclosing the secret share owned by each participant, thereby ensuring the safety of the data of each of the multiple participants.

In addition, model training based on a multi-party secure computing mode can also use a trusted zone in service equipment as an execution environment isolated from the outside, encrypted data is decrypted in the trusted zone to obtain user data, and a user data training model is adopted in the trusted zone, so that the user data is not exposed outside the trusted zone all the time in the whole model training process, and the user privacy is protected.

Of course, the above illustrates only two implementations of multi-party secure computing. Those skilled in the art will appreciate that multi-party security computing is well established in the art and will not be described in detail herein.

It should be noted that, the execution sequence of step 201 and step 203 is not sequential, and step 201 may be executed first, and then step 203 may be executed; step 203 may be executed first, and then step 201 may be executed; step 201 and step 203 may also be performed simultaneously.

Then, step 205 is performed to obtain the evaluation result of the first model and the evaluation result of the second model respectively using the test data of the first party.

That is, in step 205, the first model and the second model respectively obtain their respective evaluation results on the test data of the first party.

With regard to how the evaluation results of the first model and the second model are obtained, there are different evaluation criteria for different service scenarios:

for example, for an advertisement click-through rate model, evaluation is typically performed by an AUC measure; for credit wind control business, evaluation is typically done by KS index; for the field of electronic commerce, evaluation is generally performed using GMV index, and the like.

Then, step 207 is performed to evaluate the contribution degree of the second party data according to the promotion degree of the evaluation result of the second model relative to the evaluation result of the first model.

That is, the degree of the effect of the second model on the test data of the first party is increased relative to the effect of the first model on the test data of the first party, i.e. the contribution degree of the data of the second party.

For example, assuming that the accuracy of the first model is 90% and the accuracy of the second model is 91% through testing, the accuracy of the second model is improved by 1% relative to the accuracy of the first model, and the improved accuracy of 1% reflects the contribution degree of the second party data.

This flow ends thereafter.

In summary, in the above embodiments of the present specification, two models are trained using different data, and evaluation results of the two models are compared, so that the contribution degree of each party of data to the final service can be evaluated on the premise of protecting privacy of each party of data.

A second embodiment of the present specification relates to a method of evaluating a degree of contribution of data. The second embodiment is substantially the same as the first embodiment except that: in the first embodiment, the sharing of the added data is performed by the first party and the second party; and the second embodiment participates in data sharing by more than three parties (including three parties).

In the case of multi-party data sharing, the data of multiple parties can be added in one party, that is, the data of one party is added in more than one time, and the contribution degree of each party is evaluated according to the method in the first embodiment.

The following examples are given:

if a third party is involved in data sharing, that is, the method for evaluating the contribution degree of data further includes evaluating the contribution degree of data of the third party, according to the method of the first embodiment, the first party data and the second party data are modeled, and then the third party data is added to model, and then comparison is performed.

Specifically, when the method further includes evaluating the degree of contribution of the third-party data, the method of evaluating the degree of contribution of the data includes the steps of:

evaluating the contribution degree of second-party data according to the promotion degree of the evaluation result of the second model relative to the evaluation result of the first model;

using training data of a first party, and performing model training together with a second party and a third party based on a multi-party security calculation mode to obtain a third model, wherein the second party and the third party provide data of the second party and the third party in the process of performing model training based on the multi-party security calculation mode and the first party;

obtaining an evaluation result of the third model by using the test data of the first party;

and evaluating the contribution degree of the third-party data according to the promotion degree of the evaluation result of the third model relative to the evaluation result of the second model.

For example, if the accuracy of the first model is 90%, the accuracy of the second model is 91%, and the accuracy of the third model is 93% through testing, the accuracy of the second model is improved by 1% relative to the accuracy of the first model, the accuracy of the third model is improved by 2% relative to the accuracy of the second model, the improved accuracy of the 1% of the second model reflects the contribution degree of the second-party data, and the improved accuracy of the 2% of the third model reflects the contribution degree of the third-party data.

And under the condition that the multiple parties are four parties, modeling is carried out by using training data, second party data, third party data and fourth party data of the first party, then the test data of the first party is used to respectively obtain the evaluation results of the models, and finally the evaluation results of the models are respectively compared, so that the contribution degree of the data of the parties is evaluated.

By analogy, the method for evaluating the contribution degree of the data can be used for the situations of five parties, six parties, seven parties, \8230, 8230and data sharing, and the contribution degree of each party of data to the final service can be evaluated on the premise of protecting the privacy of the data of each party.

The first embodiment is a method embodiment corresponding to the present embodiment, and the technical details in the first embodiment may be applied to the present embodiment, and the technical details in the present embodiment may also be applied to the first embodiment.

A third embodiment of the present application relates to an apparatus for evaluating a degree of data contribution, and a schematic structural diagram thereof is shown in fig. 3.

First, it should be noted that the apparatus is used for the first party, that is, the apparatus is an apparatus used by the first party to evaluate the data contribution degree of other parties.

As shown in fig. 3, the apparatus for evaluating the degree of data contribution includes:

and the first training module is used for carrying out model training by using the training data of the first party to obtain a first model.

And the second training module is used for using the self training data of the first party and carrying out model training together with the second party on the basis of a multi-party safety calculation mode to obtain a second model, wherein the second party provides self data in the process of carrying out model training on the basis of the multi-party safety calculation mode and the first party.

The first party data uses the training data of the first party, the second party data uses the data of the second party, and a modeling method of data sharing (multi-party safety calculation) is used to obtain a second model.

Multi-party security computing is a collaborative computing problem that solves privacy protection among a group of distrusted parties, for example, a logistic regression model is trained together. The multi-party security calculation needs to ensure the independence of input and the correctness of calculation, and meanwhile, each input value is not leaked to other participants participating in the calculation. After the calculation is completed, the results are given to each participant.

The secure computing modes of multiple parties can be mainly divided into three categories:

1. an obfuscation circuit;

2. carrying out homomorphic encryption;

3. and (4) secret sharing.

For example, a common logistic regression model, three methods can be realized, and each method has advantages and disadvantages. That is, in the present embodiment, the ways of multi-party secure computation include the above three ways.

The secret sharing is a cryptographic technology for storing a secret in a split manner, the secret is split into a plurality of secret shares in a proper manner, each secret share is owned and managed by one of a plurality of participants, a single participant cannot recover the complete secret, and the complete secret can be recovered only by cooperation of a plurality of participants. Secret sharing aims to prevent the secret from being too centralized so as to achieve the purposes of dispersing risks and tolerating intrusion.

Secret sharing can be broadly divided into two categories: there is trusted initializer secret sharing and untrusted initializer secret sharing. In secret sharing with a trusted initiator, the trusted initiator is required to perform parameter initialization (often to generate random numbers meeting certain conditions) on each participant participating in multi-party secure computing. After the initialization is completed, the trusted initialization party destroys the data and disappears at the same time, and the data are not needed in the following multi-party security calculation process.

The secret-shared matrix multiplication with the trusted initiator applies to the following cases: the complete secret data is a product of the first set of secret shares and the second set of secret shares, and each of the participants has one of the first set of secret shares and one of the second set of secret shares. By the secret sharing matrix multiplication of the trusted initiator, each of the multiple participants can obtain partial complete secret data of the complete secret data, the sum of the partial complete secret data obtained by each participant is the complete secret data, and each participant discloses the obtained partial complete secret data to the rest of the participants, so that each participant can obtain the complete secret data without disclosing the secret shares owned by each participant, and the safety of the data of each participant is ensured.

Of course, the above illustrates only two implementations of multi-party secure computing. Those skilled in the art will appreciate that multi-party security computing is well known in the art and will not be described in detail herein.

And the first testing module is used for respectively obtaining the evaluation result of the first model and the evaluation result of the second model by using the testing data of the first party.

The first model and the second model respectively obtain respective evaluation results on the test data of the first party.

That is, the degree of improvement of the effect of the second model on the test data of the first party relative to the effect of the first model on the test data of the first party is the contribution degree of the data of the second party.

A fourth embodiment of the present specification relates to an apparatus for evaluating a degree of contribution of data. The fourth embodiment is substantially the same as the third embodiment except that: in the first embodiment, the sharing of the added data is performed by the first party and the second party; and the second embodiment is concerned with data sharing by more than three parties (including three parties).

In the case of multi-party data, the multi-party data can be added in one party, i.e. each time one more party data is added, the device in the third embodiment is used to evaluate the contribution degree of each party data.

The following description will take three parties participating in data sharing as an example:

that is, the apparatus is further configured to evaluate the degree of contribution of the third-party data, in which case the apparatus for evaluating the degree of contribution of the data further includes:

and the third training module is used for using the training data of the first party and carrying out model training together with the second party and the third party based on a multi-party safety calculation mode to obtain a third model, wherein the second party and the third party provide the data of the second party and the third party in the process of carrying out model training based on the multi-party safety calculation mode and the first party.

And the second testing module is used for obtaining the evaluation result of the third model by using the testing data of the first party.

And the second evaluation module is used for evaluating the contribution degree of the third-party data according to the promotion degree of the evaluation result of the third model relative to the evaluation result of the second model.

For example, suppose in one case that the accuracy of the first model is 90%, the accuracy of the second model is 91%, and the accuracy of the third model is 93% through testing, the accuracy of the second model is improved by 1% relative to the accuracy of the first model, the accuracy of the third model is improved by 2% relative to the accuracy of the second model, the improved 1% accuracy of the second model reflects the contribution degree of the second-party data, and the improved 2% accuracy of the third model reflects the contribution degree of the third-party data.

By analogy, the method for evaluating the data contribution degree can be used for five-party, six-party, seven-party, \8230, data sharing, and the contribution degree of each party of data to the final business can be evaluated on the premise of protecting the data privacy of each party.

The second embodiment is a method embodiment corresponding to the present embodiment, and the technical details in the second embodiment may be applied to the present embodiment, and the technical details in the present embodiment may also be applied to the second embodiment.

It should be noted that, as will be understood by those skilled in the art, the implementation functions of the modules shown in the embodiment of the apparatus for evaluating the data contribution degree may be understood by referring to the related description of the method for evaluating the data contribution degree. The functions of the modules shown in the embodiments of the apparatus for evaluating the data contribution degree may be implemented by a program (executable instructions) running on a processor, or may be implemented by specific logic circuits. The above-mentioned means for evaluating the degree of data contribution in the embodiments of the present specification, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be substantially or partially embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present description are not limited to any specific combination of hardware and software.

Accordingly, the present specification embodiments also provide a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the method embodiments of the specification. Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable storage medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

In addition, the present specification also provides an apparatus for evaluating a degree of data contribution, comprising a memory for storing computer-executable instructions, and a processor; the processor is configured to implement the steps of the method embodiments described above when executing the computer-executable instructions in the memory. The Processor may be a Central Processing Unit (CPU), other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or the like. The aforementioned memory may be a read-only memory (ROM), a Random Access Memory (RAM), a Flash memory (Flash), a hard disk, or a solid state disk. The steps of the method disclosed in the embodiments of the present invention may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.

It is noted that, in the present patent application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or apparatus that comprises the element. In the present patent application, if it is mentioned that a certain action is executed according to a certain element, it means that the action is executed according to at least the element, and two cases are included: performing the action based only on the element, and performing the action based on the element and other elements. Multiple, etc. expressions include 2, 2 2 kinds and more than 2, more than 2 times and more than 2 kinds.

All documents mentioned in this specification are to be considered as being incorporated in their entirety into the disclosure of this specification so as to be subject to modification as necessary. It should be understood that the above description is only for the preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of one or more embodiments of the present disclosure should be included in the scope of protection of one or more embodiments of the present disclosure.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Claims

1. A method of evaluating a degree of data contribution, the method performed by a first party, comprising:

2. The method of claim 1, wherein the method further comprises evaluating a degree of contribution of third party data, the method further comprising:

using training data of a first party, and performing model training together with a second party and a third party based on a multi-party safety calculation mode to obtain a third model, wherein the second party and the third party provide own data in the process of performing model training based on the multi-party safety calculation mode and the first party;

3. The method of claim 1, wherein the model comprises: a logistic regression model, a neural network model, or a tree model.

4. The method of claim 1 or 2, wherein the multi-party security computation comprises: garbled circuits, homomorphic encryption, or secret sharing.

5. An apparatus for evaluating a degree of data contribution, the apparatus for use with a first party, comprising:

6. The apparatus of claim 5, wherein the apparatus is further configured to evaluate a degree of contribution of third party data, the apparatus further comprising:

the third training module is used for using the self training data of the first party and carrying out model training together with the second party and the third party based on a multi-party safety calculation mode to obtain a third model, wherein the second party and the third party provide self data in the process of carrying out model training based on the multi-party safety calculation mode and the first party;

the second testing module is used for obtaining the evaluation result of the third model by using the testing data of the first party;

7. The apparatus of claim 5, wherein the model comprises: a logistic regression model, a neural network model, or a tree model.

8. The apparatus of claim 6 or 7, wherein the means for multi-party secure computation comprises: garbled circuits, homomorphic encryption, or secret sharing.

9. An apparatus for evaluating a degree of data contribution, comprising:

a memory for storing computer executable instructions; and the number of the first and second groups,

a processor for implementing the steps in the method of any one of claims 1-4 when executing the computer-executable instructions.

10. A computer-readable storage medium having computer-executable instructions stored therein, which when executed by a processor implement the steps in the method of any one of claims 1-4.