CN109299161B - Data selection method and device - Google Patents

Data selection method and device Download PDF

Info

Publication number
CN109299161B
CN109299161B CN201811286327.4A CN201811286327A CN109299161B CN 109299161 B CN109299161 B CN 109299161B CN 201811286327 A CN201811286327 A CN 201811286327A CN 109299161 B CN109299161 B CN 109299161B
Authority
CN
China
Prior art keywords
data
training
sample
parties
party
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811286327.4A
Other languages
Chinese (zh)
Other versions
CN109299161A (en
Inventor
方文静
王力
周俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201811286327.4A priority Critical patent/CN109299161B/en
Publication of CN109299161A publication Critical patent/CN109299161A/en
Application granted granted Critical
Publication of CN109299161B publication Critical patent/CN109299161B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

The embodiment of the specification provides a data selection method and device, wherein the method can comprise the following steps: training a machine learning model according to the model entering variables and the labels in the training samples; the training sample also comprises a non-input variable; inputting the input variables in the test sample into a machine learning model to obtain a predicted value; the test specimen further comprises a label; obtaining a residual error corresponding to the test sample according to the label and the predicted value of the test sample; respectively sending the residual errors to at least two second data parties, so that the second data parties respectively use the owned second data to regression and fit the residual errors, and obtaining regression evaluation indexes; and receiving regression evaluation indexes returned by the at least two second data parties respectively so as to select a part of the second data parties by comparing the regression evaluation indexes of the at least two second data parties.

Description

Data selection method and device
Technical Field
The present disclosure relates to the field of big data technologies, and in particular, to a data selection method and apparatus.
Background
With the rapid development of internet technology, the whole society is forced to be pushed into the "big data" era. Regardless of whether people like, our personal data is inadvertently being passively collected and used by businesses, individuals, etc. Networking and transparentization of personal data has become a big trend that is not blocked. At the same time, the user data is also a dangerous "panda box", and once the data is leaked, the privacy of the user will be violated. In recent years, many user privacy disclosure events have occurred, and protection of private data by citizens' individuals has been a serious challenge. The global revolution brought by big data makes it difficult for individual users to fight the risk of full exposure of personal privacy. In the face of frequent privacy disclosure events, the privacy protection problem needs to be effectively solved.
In real business, we may encounter the following scenario: the effect of the existing model is improved by means of variable data of a third-party channel, and corresponding third-party data is purchased only when the data can help modeling. Therefore, we need to judge the validity in advance without acquiring third-party data, and in the process, the private data of our users cannot be revealed.
Disclosure of Invention
In view of the above, one or more embodiments of the present specification provide a data selection method and apparatus to protect internal data privacy while selecting a part of data from a plurality of external data.
Specifically, one or more embodiments of the present disclosure are implemented by the following technical solutions:
in a first aspect, a data selection method is provided, the method is applied to selecting a part of second data parties from at least two second data parties providing data; the method is performed by a first data party, the first data party having first data comprising: training and testing sets of machine learning models; the training set comprises a plurality of training samples, and the test set comprises a plurality of test samples; the method comprises the following steps:
training the machine learning model according to the model entering variables and the labels in the training samples; the training sample also comprises non-model-entering variables which do not participate in the training of the machine learning model;
inputting the mode-entering variables in the test sample into the machine learning model to obtain a predicted value; the test sample further comprises a label, and the label represents an expected predicted value of the input model variable of the test sample into the machine learning model;
obtaining a residual error corresponding to the test sample according to the label of the test sample and the predicted value;
respectively sending the residual errors to the at least two second data parties, so that the second data parties respectively use the owned second data to perform regression fitting on the residual errors and obtain regression evaluation indexes;
and receiving regression evaluation indexes returned by the at least two second data parties respectively so as to select a part of the second data parties by comparing the regression evaluation indexes of the at least two second data parties.
In a second aspect, a method for verifying data validity is provided, where the method is performed by a second data party, and includes:
receiving a residual error sent by a first data party, wherein the residual error is obtained by inputting a predicted value obtained by a machine learning model and a label of a test sample according to a mode entering variable in the test sample by the first data party; the data owned by the first data party includes: a training set and a test set, the training set including a plurality of training samples, the test set including a plurality of test samples; the machine learning model is obtained by training according to the model entering variables and the labels in the training samples; the training sample also comprises a non-input-mode variable;
receiving a sample identifier sent by a first data party, and performing sample matching according to the sample identifier to obtain second data for participating in regression fitting;
performing regression fitting on the residual error based on the second data to obtain a regression evaluation index;
and returning the regression evaluation index to the first data party so that the first data party selects part of the second data parties by comparing the regression evaluation indexes of at least two second data parties to be selected.
In a third aspect, a verification apparatus for data validity is provided, the apparatus being applied to select a part of at least two second data parties providing data; the device is applied to a first data party, and first data owned by the first data party comprises: training and testing sets of machine learning models; the training set comprises a plurality of training samples, and the test set comprises a plurality of test samples; the device comprises:
the model training module is used for training the machine learning model according to the model entering variables and the labels in the training samples; the training sample also comprises non-model-entering variables which do not participate in the training of the machine learning model;
the model prediction module is used for inputting the model entering variables in the test sample into the machine learning model to obtain predicted values; the test sample further comprises a label, and the label represents an expected predicted value of the input model variable of the test sample into the machine learning model;
the residual error calculation module is used for obtaining a residual error corresponding to the test sample according to the label of the test sample and the predicted value;
the data sending module is used for sending the residual errors to the at least two second data parties respectively so that the second data parties use the owned second data to perform regression fitting on the residual errors respectively and obtain regression evaluation indexes;
and the verification processing module is used for receiving the regression evaluation indexes returned by the at least two second data parties respectively so as to select part of the second data parties by comparing the regression evaluation indexes of the at least two second data parties.
In a fourth aspect, an apparatus for verifying data validity is provided, the apparatus being applied to a second data party, and the apparatus comprising:
the residual error receiving module is used for receiving a residual error sent by a first data party, wherein the residual error is obtained by inputting a predicted value and a label obtained by a machine learning model according to a mode entering variable in a test sample by the first data party; the data owned by the first data party includes: a training set and a test set, the training set including a plurality of training samples, the test set including a plurality of test samples; the machine learning model is obtained by training according to the model entering variables and the labels in the training samples; the training sample also comprises a non-input-mode variable;
the data matching module is used for receiving the sample identification sent by the first data party and carrying out sample matching according to the sample identification to obtain second data for participating in regression fitting;
the regression processing module is used for performing regression fitting on the residual error based on the second data to obtain a regression evaluation index;
and the index feedback module is used for returning the regression evaluation index to the first data party so that the first data party selects part of the second data parties by comparing the regression evaluation indexes of at least two second data parties to be selected.
In a fifth aspect, there is provided a device for verifying data validity, the device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the program:
training a machine learning model according to the model entering variables and the labels in the training samples; the training sample also comprises non-model-entering variables which do not participate in the training of the machine learning model;
inputting the mode-entering variables in the test sample into the machine learning model to obtain a predicted value; the test sample further comprises a label, and the label represents an expected predicted value of the input model variable of the test sample into the machine learning model;
obtaining a residual error corresponding to the test sample according to the label of the test sample and the predicted value;
respectively sending the residual errors to the at least two second data parties, so that the second data parties respectively use the owned second data to perform regression fitting on the residual errors and obtain regression evaluation indexes;
and receiving regression evaluation indexes returned by the at least two second data parties respectively so as to select a part of the second data parties by comparing the regression evaluation indexes of the at least two second data parties.
In a sixth aspect, there is provided a device for verifying data validity, the device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the program:
receiving a residual error sent by a first data party, wherein the residual error is obtained by inputting a predicted value obtained by a machine learning model and a label of a test sample according to a mode entering variable in the test sample by the first data party; the data owned by the first data party includes: a training set and a test set, the training set including a plurality of training samples, the test set including a plurality of test samples; the machine learning model is obtained by training according to the model entering variables and the labels in the training samples; the training sample also comprises a non-input-mode variable;
receiving a sample identifier sent by the first data party, and performing sample matching according to the sample identifier to obtain second data for participating in regression fitting;
performing regression fitting on the residual error based on the second data to obtain a regression evaluation index;
and returning the regression evaluation index to the first data party so that the first data party selects part of the second data parties by comparing the regression evaluation indexes of at least two second data parties to be selected.
According to the data selection method and device in one or more embodiments of the specification, through the residual error and the regression evaluation index of interactive modeling of two data parties, the data is not the privacy data of the user, and therefore any privacy data of the user can not be revealed in the interactive process of the two parties. And partial data can be selected from a plurality of data parties according to regression evaluation indexes returned by the plurality of data parties, so that the privacy of internal data is protected while partial data is selected from a plurality of external data.
Drawings
In order to more clearly illustrate one or more embodiments or technical solutions in the prior art in the present specification, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in one or more embodiments of the present specification, and other drawings can be obtained by those skilled in the art without inventive exercise.
FIG. 1 is a schematic diagram of a data set provided in one or more embodiments of the present description;
FIG. 2 is a data selection method provided in one or more embodiments of the present disclosure;
FIG. 3 is a data selection apparatus provided in one or more embodiments herein;
fig. 4 is another data selection apparatus provided in one or more embodiments of the present disclosure.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in one or more embodiments of the present disclosure, the technical solutions in one or more embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in one or more embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments, and not all embodiments. All other embodiments that can be derived by one of ordinary skill in the art from one or more embodiments of the disclosure without making any creative effort shall fall within the protection scope of the present application.
In real business, such scenarios may be encountered: the data side A has own data, and whether the data side A can improve the model effect of the data side A or not is evaluated by means of the data side B. For example, suppose that the data party a trains a machine learning model M by using own data, but the model is found to have a less ideal prediction effect in the model test, and has a certain difference from the expected prediction value. If the model M can be improved by using the data of the data party B to participate in the training and optimization of the model M, the data of the data party B can be selected to be purchased to assist in modeling.
In the above scenario, one problem will be involved: and determining whether the data of the data party B is valid, if the data of the data party B is helpful for modeling the model M and is helpful for improving the effect of the model M, confirming that the data of the data party B is valid. What way to verify the data validity of the data party B is to be described in at least one embodiment of the present specification, and in the data validity verification method, the following will be implemented: the data party A does not acquire the data of the data party B, and the data party A does not reveal own owned data.
The following describes a verification method of data validity, taking a data party a and a data party B as an example, and the method is to verify whether data of the data party B is valid.
For example, data party a may be referred to as a first data party and data party B may be referred to as a second data party.
First, referring to fig. 1, data owned by a first data party may be referred to as first data. The first data may include: a training set and a testing set of machine learning models.
Wherein the training set is used for training of the machine learning model, e.g. training samples D in the training setA(XA,YA) In, XAIs a variable, YAIs a label. The label YARepresents the above variablesXAThe expected prediction value through the machine learning model is equivalent to a supervised model.
The test set being used for prediction of a machine learning model, e.g. test samples D in the test setB(XB,YB) Variables and labels are also included.
For example, the variables of the training samples and the test samples described above may each include "in-mode variables" and "not in-mode variables". The model-entering variables in the training samples participate in the training of the model, the model-entering variables in the testing samples participate in the prediction of the model, and the non-model-entering variables do not participate in the training and prediction of the model.
Examples are as follows: taking the example of determining whether a user is a good user or a bad user, the user can be represented by a plurality of variables, such as age, address, working year, annual income, and the like. Assuming that a user can be represented by 8 variables, U { f1, f2, f3, f4 … … f8} is eight variables including f1 to f8 for a user U. In training the model, five variables f1 to f5 may be used first, and f6 to f8 may not be involved in the training of the model first.
Then, in training sample DA(XA,YA) May include a plurality of user samples, such as user U1, user U2, user U3, and the like. Each user sample is DA(XA,YA) Comprising a variable and a tag, wherein the variable XAThe above-mentioned five variables f1 to f5 of the users can be included, the variables in each user sample are the five variables, and the variable values can be different; and the label YAIt may be that the user is a good user or a bad user, e.g. good user is denoted by 11 and bad user is denoted by 00.
Predicted test samples D for machine learning modelsB(XB,YB) Also including variables and labels, D when model predictions are madeBThe variables used include five variables f1 through f5 for the user, f6 through f8 not participating in the prediction, and the label is that the user is a good user or a bad user. When the test set is used for prediction, the input mode variables of the test sample are inputAnd (4) entering the trained model, and judging whether the output result of the model is consistent with the label.
The training samples, the test samples, and the in-modulus variables and the non-in-modulus variables therein may be exemplified by table 1 as follows. As shown in table 1, the samples U1, U2, and U3 will participate in the training of the model, which may be referred to as a training set. However, when participating in model training, only the variables f1 to f5 participate in the model training and may be referred to as mode-entering variables, while the variables f6 to f8 temporarily do not participate in the model training and are referred to as non-mode-entering variables. Y isAIs a label. For another example, the samples of U7 and U8 in the test set are used for model prediction, and the in-mode variables in the test samples are input into the trained model to obtain the output result of the model. Similarly, when the U7 and U8 are input into the model, only the variables f1 to f5 participate, and the variables f6 to f8 do not participate. Table 1 below is merely an example, and the actual implementation is not limited thereto, and the variables included in the respective samples may be changed.
TABLE 1 first data DA(XA,YA)
In the above example, when the model is tested using the test samples in table 1, it is found that the model has a less than ideal effect, and in this case, it is assumed that at least two second data parties, such as data party B, can provide data, and the data parties B may have different variables or have different variable values of the same variable. The optimization model may be aided by the selection of a portion of the data parties B from among the plurality of data parties B. For example, an optimal data party B may be selected from three data parties B, or two or more data parties B may be selected, and the decision may be considered according to the actual business needs. How to evaluate which of the three data parties B is superior may use the data selection method of at least one embodiment of the present specification.
Fig. 2 illustrates a data selection method provided in at least one embodiment of the present specification, which may include the following processes, and the implementation does not limit the execution sequence of the steps:
in step 200, a machine learning model is trained based on the training samples.
This step may train the model using the in-mold variables and labels in the training samples. For example, the model may be trained using the data of U1, U2, and U3 in table 1, where U1, U2, and U3 are user samples, each of which may include eight variables, and five of the variables f1 to f5 may be used in training.
In step 202, the model-entering variables in the test sample are input into the machine learning model to obtain predicted values.
For example, the test samples U7 and U8 in Table 1 are not involved in the training of the model, but may be used for the testing of the model. Five variables f1 to f5 in the test sample can be used as inputs and input into the model trained in step 200, and the obtained model output result is the predicted value. The labels in the test samples represent expected predicted values of the input variables of the test samples into the machine learning model.
In step 204, a residual corresponding to the test sample is obtained according to the predicted value and the label in the test sample. For example, the labels for U7 and U8 are Y in Table 1A7And YA8And the residual may be a difference between the predicted value and the label, and the residual may be used to represent a difference between an actual output result and an expected output result of the model, and thus may be used to measure a prediction effect of the model.
In step 206, the data party a sends the residuals to at least two second data parties to be selected, respectively. In this step, the residual error corresponding to the test sample of the data side a may be sent to the data side B, and the training sample and the sample identifier corresponding to the test sample may also be sent to the data side B. For example, the sample identification may include user IDs of U1 through U3.
For example, the user ID may be encrypted by an encryption algorithm such as MD5 to avoid user information leakage. The transmitted residual error is a gap measure with the original label, and can also serve the purpose of protecting the privacy of the user.
It should be noted that, in the following steps, step 206 to step 214, the description is given by taking the data side a to two data sides B as an example, and there may be a greater number of data sides B in actual implementation. In the illustration of fig. 2, the same step numbers are used for the transmission of the sample identifications and residuals to the two data parties B and the sample matching and regression fitting processes performed by the two data parties B, for example, both are step 206, but it is understood that these are operations performed by the two data parties B respectively.
In step 208, the data party B performs sample matching according to the sample identifier to obtain second data.
For example, data party B may perform sample matching based on the user IDs of U1 and U3, obtaining second data for participating in subsequent regression fits. For example, the data of U1 and U3 owned by the data side B can be obtained, and the variables f9 to f11 can be obtained, see table 2 above.
The second data may include data corresponding to the sample IDs of the training sample and the test sample of the data partner a. The user samples corresponding to the sample identifications of the training samples of data party a may also be referred to as training samples in data party B, and the user samples corresponding to the sample identifications of the test samples of data party a may be referred to as test samples in data party B.
In step 210, the data side B regression fits the residual error based on the variables in the second data to obtain a regression evaluation index.
For example, a plurality of user samples in a sample may be tested, each sample may correspond to one residual, and a plurality of the samples may result in a plurality of residuals. The various residuals described above may be regression fit using various variables in the training samples of data party B. The purpose of the fitting is to enable a polynomial function to be fitted from the training samples, which function fits the residuals well.
For example, assume that the plurality of residuals may include y1、y2……yn. Wherein n is a natural number.
The variables in each training sample may include: x is the number of1、x2……xi. Wherein i is a natural number.
y1=a1*x11+a2*x12+……ai*x1i;……(1)
y2=a1*x21+a2*x22+……ai*x2i;……(2)
……………
yn=a1*xn1+a2*xn2+……ai*xni;……(n)
Wherein each residual y1To ynIt is known that the values of the variables in each training sample are also known, e.g., { x ] in equation (1) above11、x12……x1nIs the value of each variable in a training sample, x in equation (2)21、x22……x2nIs the value of each variable in another training sample. The coefficient a can be obtained by the above-mentioned formulas (1) to (n)1、a2……aiFinally obtaining a regression equation y ═ a1*x1+a2*x2+……ai*xi
The obtained regression equation can obtain the importance weight of the variable corresponding to each variable, the above-mentioned a1、a2……aiThe value of (1) is the importance weight of the variable corresponding to each variable.
The above example is a linear regression, but is not limited to this. Other regression approaches, such as polynomial regression, may also be used.
Furthermore, the regression evaluation index of the regression can be calculated. The regression evaluation index may be various, and may be, for example, a Mean square Error (rms), a Root Mean Square Error (RMSE), an average absolute Error (rms), or the like. Regression evaluation metrics may be used to measure the effectiveness of the regression fit.
For example, the regression evaluation index takes the mean square error as an example:
in the formula (5), m represents the number of test samples, yiRepresenting true value, ynAnd representing a predicted value, and subtracting the predicted value from the true value, and then summing and averaging after squaring. For example, for each test sample, each test sample corresponds to a residual, taking one of the test samples as an example, the residual corresponding to the test sample is a true value, and the value of the residual obtained by substituting the value of the variable in the test sample into the regression equation obtained above is a predicted value. And (4) according to the formula (5), making a difference between the true value and the predicted value of each test sample, squaring, summing and averaging to obtain the mean square error of the regression evaluation index.
In step 212, the second data party returns the regression evaluation index to the first data party. In this step, the at least two data parties B may respectively return the regression evaluation indexes calculated by themselves to the data party a.
In addition, the data side B can also obtain at least one of the following parameters: a sample match rate and a variable miss rate of the second data. The sample matching rate can be understood as how much data party B can find out the data required by data party a, for example, the sample transmitted to data party B by data party a is identified with eight, that is, data party B is required to provide user samples of eight users. And the data side B has only 6, the sample matching rate may be 6/8 × 100% to 75%. The loss rate of the variables can be understood as follows: the data party B can find a certain variable that the data party a requires, but the variable value is somewhat missing. For example, the data side B has data of 10 user samples, and the 10 user samples all have the variable f10, but the variable value of two users at f10 is null, that is, the variable is missing, and the variable missing rate may be 20%.
The data party B can return the regression evaluation index to the data party A, and can also return at least one of the sample matching rate and the variable missing rate to the data party A, so that the first data party can select the data party by combining the regression evaluation index, the sample matching rate and the variable missing rate.
In step 214, the first data party determines a selected portion of the second data parties by comparing regression rating indicators of the plurality of second data parties.
In this step, the data party a may compare the regression evaluation indexes alone, for example, the regression evaluation indexes returned by the two data parties B may be compared, and which data party B is selected as being better in which index. Of course, a plurality of data bases B having better regression evaluation indexes may be selected.
Or, the sample matching rate, the variable missing rate, and the regression evaluation index may also be considered comprehensively, for example, the second data party with the sample matching rate higher than the preset threshold may be selected first, and the data party with the lower matching rate may be discarded. And selecting a data party B with a better index from the second data with the sample matching rate higher than the preset threshold, for example, sorting regression evaluation indexes, and selecting the data party B sorted in the first few digits. Of course, in other examples, other indicators such as the loss rate of the variables may be considered together. For example, threshold values may be set for the sample match rate and the sample miss rate, and second data below the threshold values are not selected regardless of the regression evaluation index.
The regression can use various regression algorithms, but it is required to ensure that each data party B uses a uniform regression algorithm and a uniform regression evaluation index, so as to avoid the impartiality that each data party B uses different factors to affect the subsequent comparison.
The data validity determination in this step may be performed automatically by a computer or manually, for example, after the data side B returns the sample matching rate, the sample missing rate, and the regression evaluation index to the data side a, the manager of the data side a determines based on these returned indexes to select the data side B.
In the data selection method according to one or more embodiments of the present specification, the data party a only sends the modeled residual error to the data parties B, and the data parties B only return the regression evaluation index to the data party a, and the data parties interact with the modeled residual error and the regression evaluation index, but not with the privacy data of the user, so that any privacy data of the user is not revealed in the interaction process between the data parties. Furthermore, partial data can be selected from the plurality of data parties B according to regression evaluation indexes returned by the plurality of data parties B, so that internal data privacy is protected while partial data are selected from the plurality of external data.
Fig. 3 is a data selection apparatus provided in at least one embodiment of the present specification, the apparatus being applied to select a part of second data parties from at least two second data parties providing data; the device is applied to a first data party, and first data owned by the first data party comprises: training and testing sets of machine learning models; the training set includes a plurality of training samples and the test set includes a plurality of test samples. As shown in fig. 3, the apparatus may include: a model training module 31, a model prediction module 32, a residual calculation module 33, a data transmission module 34 and a verification processing module 35.
The model training module 31 is configured to train the machine learning model according to the model entry variables and the labels in the training samples; the training samples also include non-modelled variables that do not participate in machine learning model training.
The model prediction module 32 is configured to input the model entry variables in the test sample into the machine learning model to obtain predicted values; the test sample further includes a label representing an expected predicted value of an input-to-model variable of the test sample into the machine learning model.
And a residual calculation module 33, configured to obtain a residual corresponding to the test sample according to the label of the test sample and the predicted value.
The data sending module 34 is configured to send the residual errors to the at least two second data parties respectively, so that each second data party uses the owned second data to perform regression fitting on the residual errors respectively, and obtain regression evaluation indexes;
the verification processing module 35 is configured to receive regression evaluation indexes returned by the at least two second data parties, respectively, so as to select a part of the second data parties by comparing the regression evaluation indexes of the at least two second data parties.
In one example, the verification processing module 35 is further configured to receive a sample matching rate returned by the second data party; and selecting part of second data owned by the second data party from the at least two second data parties according to the regression evaluation index from the second data with the sample matching rate higher than the preset threshold.
Fig. 4 is another data selection apparatus provided in at least one embodiment of the present specification, where the apparatus is applied to a second data party, and as shown in fig. 4, the apparatus may include: a residual receiving module 41, a data matching module 42, a regression processing module 43, and an index feedback module 44.
A residual receiving module 41, configured to receive a residual sent by a first data party, where the residual is obtained by inputting a predicted value and a label, obtained by the first data party, into a machine learning model according to a mode-entering variable in a test sample; the data owned by the first data party includes: a training set and a test set, the training set including a plurality of training samples, the test set including a plurality of test samples; the machine learning model is obtained by training according to the model entering variables and the labels in the training samples; the training sample also comprises non-input-mode variables.
And the data matching module 42 is configured to receive a sample identifier corresponding to the training sample, and perform sample matching according to the sample identifier to obtain second data for participating in regression fitting.
A regression processing module 43, configured to perform regression fitting on the residual error based on the second data to obtain a regression evaluation index;
and an index feedback module 44, configured to return the regression evaluation index to the first data party, so that the first data party selects a part of the second data parties by comparing the regression evaluation indexes of at least two second data parties to be selected.
The embodiment of the present specification further provides a device for verifying data validity, where the device is applied to a first data side, the device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the following steps when executing the program:
training a machine learning model according to the model entering variables and the labels in the training samples; the training sample also comprises non-model-entering variables which do not participate in the training of the machine learning model;
inputting the mode-entering variables in the test sample into the machine learning model to obtain a predicted value; the test sample further comprises a label, and the label represents an expected predicted value of the input model variable of the test sample into the machine learning model;
obtaining a residual error corresponding to the test sample according to the label of the test sample and the predicted value;
respectively sending the residual errors to the at least two second data parties, so that the second data parties respectively use the owned second data to perform regression fitting on the residual errors and obtain regression evaluation indexes;
and receiving regression evaluation indexes returned by the at least two second data parties respectively so as to select a part of the second data parties by comparing the regression evaluation indexes of the at least two second data parties.
The present specification further provides a device for verifying data validity, which is applied to a second data party, and includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the following steps when executing the program:
receiving a residual error sent by a first data party, wherein the residual error is obtained by inputting a predicted value obtained by a machine learning model and a label of a test sample according to a mode entering variable in the test sample by the first data party; the data owned by the first data party includes: a training set and a test set, the training set including a plurality of training samples, the test set including a plurality of test samples; the machine learning model is obtained by training according to the model entering variables and the labels in the training samples; the training sample also comprises a non-input-mode variable;
receiving a sample identifier corresponding to the training sample, and performing sample matching according to the sample identifier to obtain second data for participating in regression fitting;
performing regression fitting on the residual error based on the second data to obtain a regression evaluation index;
and returning the regression evaluation index to the first data party so that the first data party selects part of the second data parties by comparing the regression evaluation indexes of at least two second data parties to be selected.
The execution sequence of each step in the flow shown in the above method embodiment is not limited to the sequence in the flowchart. Furthermore, the description of each step may be implemented in software, hardware or a combination thereof, for example, a person skilled in the art may implement it in the form of software code, and may be a computer executable instruction capable of implementing the corresponding logical function of the step. When implemented in software, the executable instructions may be stored in a memory and executed by a processor in the device.
The apparatuses or modules illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the modules may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.
One skilled in the art will recognize that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
One or more embodiments of the present description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the data acquisition device or the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the partial description of the method embodiment.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The above description is only exemplary of the preferred embodiment of one or more embodiments of the present disclosure, and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (10)

1. A data selection method applied to selecting a part of second data parties from at least two second data parties providing data; the method is performed by a first data party, the first data party having first data comprising: training and testing sets of machine learning models; the training set comprises a plurality of training samples, and the test set comprises a plurality of test samples;
the method comprises the following steps:
training the machine learning model according to the model entering variables and the labels in the training samples; the training sample also comprises non-model-entering variables which do not participate in the training of the machine learning model;
inputting the mode-entering variables in the test sample into the machine learning model to obtain a predicted value; the test sample further comprises a label, and the label represents an expected predicted value of the input model variable of the test sample into the machine learning model;
obtaining a residual error corresponding to the test sample according to the label of the test sample and the predicted value;
respectively sending the residual errors to the at least two second data parties, so that the second data parties respectively use the owned second data to perform regression fitting on the residual errors and obtain regression evaluation indexes;
and receiving regression evaluation indexes returned by the at least two second data parties respectively so as to select a part of the second data parties by comparing the regression evaluation indexes of the at least two second data parties.
2. The method of claim 1, further comprising:
and sending the sample identifications of the training sample and the test sample to a second data party, so that the second data party performs sample matching according to the sample identifications to obtain second data.
3. The method of claim 1, further comprising:
receiving a sample matching rate returned by the second data party;
and selecting part of second data owned by the second data party from the at least two second data parties according to the regression evaluation index from the second data with the sample matching rate higher than the preset threshold.
4. A method of data selection, the method performed by a second data party, comprising:
receiving a residual error sent by a first data party, wherein the residual error is obtained by inputting a predicted value obtained by a machine learning model and a label of a test sample according to a mode entering variable in the test sample by the first data party; the data owned by the first data party includes: a training set and a test set, the training set including a plurality of training samples, the test set including a plurality of test samples; the machine learning model is obtained by training according to the model entering variables and the labels in the training samples; the training sample also comprises a non-input-mode variable;
receiving a sample identifier sent by a first data party, and performing sample matching according to the sample identifier to obtain second data for participating in regression fitting;
performing regression fitting on the residual error based on the second data to obtain a regression evaluation index;
and returning the regression evaluation index to the first data party so that the first data party selects part of the second data parties by comparing the regression evaluation indexes of at least two second data parties to be selected.
5. The method of claim 4, further comprising:
acquiring at least one of the following parameters of the second data: sample matching rate and variable missing rate;
and returning the parameters to the first data side, so that the first data side combines the parameters and the regression evaluation indexes to select a part of second data sides.
6. A data selection apparatus, said apparatus being adapted to select a part of second data parties from at least two second data parties providing data; the device is applied to a first data party, and first data owned by the first data party comprises: training and testing sets of machine learning models; the training set comprises a plurality of training samples, and the test set comprises a plurality of test samples; the device comprises:
the model training module is used for training the machine learning model according to the model entering variables and the labels in the training samples; the training sample also comprises non-model-entering variables which do not participate in the training of the machine learning model;
the model prediction module is used for inputting the model entering variables in the test sample into the machine learning model to obtain predicted values; the test sample further comprises a label, and the label represents an expected predicted value of the input model variable of the test sample into the machine learning model;
the residual error calculation module is used for obtaining a residual error corresponding to the test sample according to the label of the test sample and the predicted value;
the data sending module is used for sending the residual errors to the at least two second data parties respectively so that the second data parties use the owned second data to perform regression fitting on the residual errors respectively and obtain regression evaluation indexes;
and the verification processing module is used for receiving the regression evaluation indexes returned by the at least two second data parties respectively so as to select part of the second data parties by comparing the regression evaluation indexes of the at least two second data parties.
7. The apparatus of claim 6, wherein the first and second electrodes are disposed on opposite sides of the substrate,
the verification processing module is also used for receiving the sample matching rate returned by the second data party; and selecting part of second data owned by the second data party from the at least two second data parties according to the regression evaluation index from the second data with the sample matching rate higher than the preset threshold.
8. A data selection apparatus, the apparatus being applied to a second data party, the apparatus comprising:
the residual error receiving module is used for receiving a residual error sent by a first data party, wherein the residual error is obtained by inputting a predicted value and a label obtained by a machine learning model according to a mode entering variable in a test sample by the first data party; the data owned by the first data party includes: a training set and a test set, the training set including a plurality of training samples, the test set including a plurality of test samples; the machine learning model is obtained by training according to the model entering variables and the labels in the training samples; the training sample also comprises a non-input-mode variable;
the data matching module is used for receiving the sample identification sent by the first data party and carrying out sample matching according to the sample identification to obtain second data for participating in regression fitting;
the regression processing module is used for performing regression fitting on the residual error based on the second data to obtain a regression evaluation index;
and the index feedback module is used for returning the regression evaluation index to the first data party so that the first data party selects part of the second data parties by comparing the regression evaluation indexes of at least two second data parties to be selected.
9. A data selection apparatus, the apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the steps of:
training a machine learning model according to the model entering variables and the labels in the training samples; the training sample also comprises non-model-entering variables which do not participate in the training of the machine learning model;
inputting the mode-entering variables in the test sample into the machine learning model to obtain a predicted value; the test sample further comprises a label, and the label represents an expected predicted value of the input model variable of the test sample into the machine learning model;
obtaining a residual error corresponding to the test sample according to the label of the test sample and the predicted value;
respectively sending the residual errors to at least two second data parties, so that each second data party respectively uses the owned second data to perform regression fitting on the residual errors, and a regression evaluation index is obtained;
and receiving regression evaluation indexes returned by the at least two second data parties respectively so as to select a part of the second data parties by comparing the regression evaluation indexes of the at least two second data parties.
10. A data selection apparatus, the apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the steps of:
receiving a residual error sent by a first data party, wherein the residual error is obtained by inputting a predicted value obtained by a machine learning model and a label of a test sample according to a mode entering variable in the test sample by the first data party; the data owned by the first data party includes: a training set and a test set, the training set including a plurality of training samples, the test set including a plurality of test samples; the machine learning model is obtained by training according to the model entering variables and the labels in the training samples; the training sample also comprises a non-input-mode variable;
receiving a sample identifier sent by the first data party, and performing sample matching according to the sample identifier to obtain second data for participating in regression fitting;
performing regression fitting on the residual error based on the second data to obtain a regression evaluation index;
and returning the regression evaluation index to the first data party so that the first data party selects part of the second data parties by comparing the regression evaluation indexes of at least two second data parties to be selected.
CN201811286327.4A 2018-10-31 2018-10-31 Data selection method and device Active CN109299161B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811286327.4A CN109299161B (en) 2018-10-31 2018-10-31 Data selection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811286327.4A CN109299161B (en) 2018-10-31 2018-10-31 Data selection method and device

Publications (2)

Publication Number Publication Date
CN109299161A CN109299161A (en) 2019-02-01
CN109299161B true CN109299161B (en) 2022-01-28

Family

ID=65145327

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811286327.4A Active CN109299161B (en) 2018-10-31 2018-10-31 Data selection method and device

Country Status (1)

Country Link
CN (1) CN109299161B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961098B (en) * 2019-03-22 2022-03-01 中国科学技术大学 Training data selection method for machine learning
US11295242B2 (en) 2019-11-13 2022-04-05 International Business Machines Corporation Automated data and label creation for supervised machine learning regression testing
CN110968886A (en) * 2019-12-20 2020-04-07 支付宝(杭州)信息技术有限公司 Method and system for screening training samples of machine learning model
CN111401483B (en) * 2020-05-15 2022-05-17 支付宝(杭州)信息技术有限公司 Sample data processing method and device and multi-party model training system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719147A (en) * 2009-11-23 2010-06-02 合肥兆尹信息科技有限责任公司 Rochester model-naive Bayesian model-based data classification system
US9501749B1 (en) * 2012-03-14 2016-11-22 The Mathworks, Inc. Classification and non-parametric regression framework with reduction of trained models
CN108280462A (en) * 2017-12-11 2018-07-13 北京三快在线科技有限公司 A kind of model training method and device, electronic equipment
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning
CN108596757A (en) * 2018-04-23 2018-09-28 大连火眼征信管理有限公司 A kind of personal credit file method and system of intelligences combination

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719147A (en) * 2009-11-23 2010-06-02 合肥兆尹信息科技有限责任公司 Rochester model-naive Bayesian model-based data classification system
US9501749B1 (en) * 2012-03-14 2016-11-22 The Mathworks, Inc. Classification and non-parametric regression framework with reduction of trained models
CN108280462A (en) * 2017-12-11 2018-07-13 北京三快在线科技有限公司 A kind of model training method and device, electronic equipment
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning
CN108596757A (en) * 2018-04-23 2018-09-28 大连火眼征信管理有限公司 A kind of personal credit file method and system of intelligences combination

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于用户行为模型的客流量分析与预测";程求江 等;《计算机系统应用》;20150315;第275-279页 *

Also Published As

Publication number Publication date
CN109299161A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN109299161B (en) Data selection method and device
Zou et al. Smart contract development: Challenges and opportunities
Hussain et al. Usability metric for mobile application: a goal question metric (GQM) approach
Siadat et al. Identifying fake feedback in cloud trust management systems using feedback evaluation component and Bayesian game model
CN110427969B (en) Data processing method and device and electronic equipment
Sethi et al. Expert-interviews led analysis of EEVi—A model for effective visualization in cyber-security
CN110166276A (en) A kind of localization method, device, terminal device and the medium of frame synchronization exception
Eckhart et al. Securing the testing process for industrial automation software
Miksa et al. Ensuring sustainability of web services dependent processes
Xiong et al. A method for assigning probability distributions in attack simulation languages
Yaâ et al. A systematic mapping study on cloud-based mobile application testing
Aldini et al. Logics to reason formally about trust computation and manipulation
Bai et al. A qualitative investigation of insecure code propagation from online forums
Wagner et al. Impact of critical infrastructure requirements on service migration guidelines to the cloud
Wright Privacy in iot blockchains: with big data comes big responsibility
Miranda et al. Social coverage for customized test adequacy and selection criteria
Ramachandran et al. Recommendations and best practices for cloud enterprise security
Park et al. Security requirements prioritization based on threat modeling and valuation graph
Bellandi et al. Possibilistic assessment of process-related disclosure risks on the cloud
Eftekhar et al. Towards the development of a widely accepted cloud trust model
Faily Further applications of CAIRIS for usable and secure software design
Gopalakrishna et al. “If security is required”: Engineering and Security Practices for Machine Learning-based IoT Devices
CN109508558A (en) A kind of verification method and device of data validity
Zhang et al. Attack simulation based software protection assessment method
Zhong et al. Design for a cloud-based hybrid Android application security assessment framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40003742

Country of ref document: HK

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co., Ltd

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co., Ltd

GR01 Patent grant
GR01 Patent grant