CN109299161A

CN109299161A - A kind of data selecting method and device

Info

Publication number: CN109299161A
Application number: CN201811286327.4A
Authority: CN
Inventors: 方文静; 王力; 周俊
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2019-02-01
Anticipated expiration: 2038-10-31
Also published as: CN109299161B

Abstract

This specification embodiment provides a kind of data selecting method and device, wherein method may include: to enter moding amount and label, training machine learning model according in training sample；Training sample further includes not entering moding amount；The moding amount input machine learning model that enters in test sample is obtained into predicted value；Test sample further includes label；According to the label and predicted value of test sample, the corresponding residual error of test sample is obtained；Residual error is respectively sent at least two second data sides, so that each second data side is respectively using the second data regression regression criterion possessed, and obtains returning evaluation index；The recurrence evaluation index that at least two second data sides return respectively is received, by comparing the second data side of recurrence evaluation index selected section of at least two second data sides.

Description

A kind of data selecting method and device

Technical field

This disclosure relates to big data technical field, in particular to a kind of data selecting method and device.

Background technique

With the rapid development of Internet technology, entire society is forcibly pushed into " big data " epoch.No matter whether people It is ready, our personal data are just inadvertently passively collected and used by enterprise, individual.The networking of personal data and Transparence has become irresistible main trend.At the same time, user data is also dangerous " box of Pandora ", data one Denier leakage, the privacy of user will be invaded.In recent years, a lot of privacy of user leakage events are had occurred that, the individual's of citizen Private data guard encounters stern challenge.Big data bring globality is changed, so that individual consumer is difficult confrontation individual The risk that privacy is exposed comprehensively.In face of the privacy leakage event to take place frequently, Privacy Protection needs to obtain effective solution.

In practical business, we are likely encountered such scene: needing to come by the variable data of third party's channel The effect of existing model is promoted, only when these data, which model us, understands helpful, just buys corresponding third number formulary According to.It would therefore be desirable to judge its validity in advance in the case where not obtaining third party's data, and in this process not The private data of our user can be revealed.

Summary of the invention

In view of this, this specification one or more embodiment provides a kind of data selecting method and device, by more Internal data privacy is protected in a external data while selected section data.

Specifically, this specification one or more embodiment is achieved by the following technical solution:

In a first aspect, providing a kind of data selecting method, the method is applied to by at least two second of offer data The second data side of selected section in data side；The method is executed by the first data side, the first data side possess first Data include: the training set and test set of machine learning model；The training set includes multiple training samples, the test set packet Include multiple test samples；The described method includes:

Enter moding amount and label, the training machine learning model according in training sample；The training sample also wraps Include have neither part nor lot in machine learning model training do not enter moding amount；

It moding amount will be entered described in the test sample inputs the machine learning model to obtain predicted value；The test Sample further includes label, the expection predicted value for entering moding amount input machine learning model of the tag representation test sample；

According to the label of the test sample and the predicted value, the corresponding residual error of the test sample is obtained；

The residual error is respectively sent at least two second data side, so that each second data side makes respectively It is fitted the residual error with the second data regression possessed, and obtains returning evaluation index；

The recurrence evaluation index that at least two second data side returns respectively is received, by comparing described at least two The second data side of recurrence evaluation index selected section of a second data side.

Second aspect, provides a kind of verification method of data validity, and the method is executed by the second data side, comprising:

The residual error of the first data side transmission is received, the residual error is that the first data root according in test sample enters moding amount The label of predicted value and test sample that input machine learning model obtains obtains；The data packet that the first data side possesses Include: training set and test set, the training set include multiple training samples, and the test set includes multiple test samples；It is described Machine learning model be according in training sample enter moding amount and label training obtains；It further include not entering in the training sample Moding amount；

The sample identification of the first data side transmission is received, and sample matches are carried out according to the sample identification and are obtained for joining With the second data of regression fit；

It is fitted the residual error based on second data regression, obtains returning evaluation index；

The recurrence evaluation index is returned into the first data side so that the first data side by comparing it is to be selected extremely The second data side of recurrence evaluation index selected section of few two the second data sides.

The third aspect, provides a kind of verifying device of data validity, and described device is applied to by offer data at least The second data side of selected section in two the second data sides；Described device is applied to the first data side, and the first data side is gathered around The first data having include: the training set and test set of machine learning model；The training set includes multiple training samples, described Test set includes multiple test samples；Described device includes:

Model training module, for entering moding amount and label, the training machine learning according in the training sample Model；The training sample further include have neither part nor lot in machine learning model training do not enter moding amount；

Model prediction module inputs the machine learning model for will enter moding amount described in the test sample and obtains To predicted value；The test sample further includes label, and the tag representation test sample enters moding amount input machine learning mould The expection predicted value of type；

Residual computations module, for according to the test sample label and the predicted value, it is corresponding to obtain test sample Residual error；

Data transmission blocks, for the residual error to be respectively sent at least two second data side, so that respectively A second data side is fitted the residual error using the second data regression possessed respectively, and obtains returning evaluation index；

Verification processing module, the recurrence evaluation index returned respectively for receiving at least two second data side, with By comparing the second data side of recurrence evaluation index selected section of at least two second data side.

Fourth aspect, provides a kind of verifying device of data validity, and described device is applied to the second data side, the device Include:

Residual error receiving module, for receiving the residual error of the first data side transmission, the residual error is the first data root according to survey The predicted value that moding amount input machine learning model obtains of entering in sample sheet and label obtain；What the first data side possessed Data include: training set and test set, and the training set includes multiple training samples, and the test set includes multiple test specimens This；The machine learning model be according in training sample enter moding amount and label training obtains；In the training sample also Including not entering moding amount；

Data match module, the sample identification sent for receiving the first data side, and according to the sample identification It carries out sample matches and obtains the second data for participating in regression fit；

Processing module is returned, for being fitted the residual error based on second data regression, obtains returning evaluation index；

Index feedback module returns to the first data side for that will return evaluation index, so that the first data side passes through Compare the second data side of recurrence evaluation index selected section of at least two second data sides to be selected.

5th aspect, provides a kind of verifying equipment of data validity, the equipment includes memory, processor and storage On a memory and the computer program that can run on a processor, the processor realize following step when executing described program It is rapid:

Enter moding amount and label, training machine learning model according in training sample；The training sample further includes not Participate in machine learning model training does not enter moding amount；

6th aspect, provides a kind of verifying equipment of data validity, the equipment includes memory, processor and storage On a memory and the computer program that can run on a processor, the processor realize following step when executing described program It is rapid:

The sample identification that the first data side is sent is received, and sample matches are carried out according to the sample identification and are used In the second data for participating in regression fit；

The data selecting method and device of this specification one or more embodiment pass through two data side's interactive modelings Residual sum returns evaluation index, and the private data of non-user, therefore appointing for user can not be revealed in both sides' interactive process What private data.Also, it can also be according to the recurrence evaluation index that multiple data sides return by selected section in multiple data sides Data protect internal data privacy while realizing the selected section data in by multiple external datas.

Detailed description of the invention

In order to illustrate more clearly of this specification one or more embodiment or technical solution in the prior art, below will A brief introduction will be made to the drawings that need to be used in the embodiment or the description of the prior art, it should be apparent that, it is described below Attached drawing is only some embodiments recorded in this specification one or more embodiment, and those of ordinary skill in the art are come It says, without any creative labor, is also possible to obtain other drawings based on these drawings.

Fig. 1 is the data set schematic diagram that this specification one or more embodiment provides；

Fig. 2 is the data selecting method that this specification one or more embodiment provides；

Fig. 3 is a kind of data selection means that this specification one or more embodiment provides；

Fig. 4 is another data selection means that this specification one or more embodiment provides.

Specific embodiment

In order to make those skilled in the art more fully understand the technical solution in this specification one or more embodiment, Below in conjunction with the attached drawing in this specification one or more embodiment, to the technology in this specification one or more embodiment Scheme is clearly and completely described, it is clear that described embodiment is only a part of the embodiment, rather than whole realities Apply example.Based on this specification one or more embodiment, those of ordinary skill in the art are not making creative work premise Under every other embodiment obtained, shall fall within the protection scope of the present application.

In practical business, be likely encountered such scene: data side A possesses own data, it is desirable to if evaluation and test By the data of data side B, the modelling effect of itself can be promoted.For example, it is assumed that data side A utilizes owned number It according to a machine learning model M is had trained, still, is found in model measurement, the prediction effect of the model is not satisfactory, and pre- Phase predicted value has a certain distance.If participating in the training and optimization of model M using the data of data side B, mould can be made The effect of type M is promoted, then can choose the data of purchase data side B to assist modeling.

In above-mentioned scene, it is involved in a problem i.e.: how to determine whether data side B is effective, if data side B Data it is helpful to the modeling of model M, facilitate the effect of lift scheme M, then confirm that the data of data side B are effective. And the data validity of which kind of mode verify data side B is used, it will be at least one embodiment of this specification content to be described, Also, in the verification method of data validity, will realize: data side A does not obtain the data of data side B, and data side A is not let out That reveals itself possesses data.

As follows by taking data side A and data side B as an example, the verification method of data validity is described, and this method will verify number It is whether effective according to the data of square B.

For example, data side A can be known as to the first data side, data side B is known as the second data side.

Firstly, shown in Figure 1, the data that the first data side possesses are properly termed as the first data.In first data It may include: the training set and test set of machine learning model.

Wherein, training set is used for the training of machine learning model, for example, the training sample D in the training set_A(X_A, Y_A) In, X_AIt is variable, Y_AIt is label.The label Y_AIndicate above-mentioned variable X_ABy the expection predicted value of the machine learning model, It is equivalent to a kind of model for having supervision.

Test set is used for the prediction of machine learning model, for example, the test sample D in test set_B(X_B, Y_B) equally include Variable and label.

For example, the variable of above-mentioned training sample and test sample, it can include " entering moding amount " and " do not enter moding Amount ".Wherein, the training for entering moding amount and taking part in model in training sample, and the moding amount that enters in test sample takes part in mould Type prediction, and do not enter training and prediction that moding amount is not engaged in model.

Be exemplified below: for judging that some user is high-quality user or user inferior, which can use multiple changes Amount indicates, for example, age, address, length of service, annual income etc..Assuming that a user can be indicated with 8 variables, U f1, F2, f3, f4 ... .f8 } it be a user U include this eight variables of f1 to f8.It, can be first using wherein in training pattern Five variable f1 to f5, and f6 to f8 is temporarily first not involved in the training of model.

So, in training sample D_A(X_A, Y_A) in, it may include multiple user's samples, for example, user U1, user U2, use Family U3 etc..Each user's sample is D_A(X_A, Y_A), including variable and label, variable X therein_AIt may include above-mentioned use Five variable f1 to f5 at family, the variable in each user's sample are this five variables, and variate-value can be different；And it is described Label Y_ACan be the user is high-quality user or user inferior, for example, high-quality user is indicated with 11,00 table of user inferior Show.

The test sample D of prediction for machine learning model_B(X_B, Y_B) it equally include variable and label, carrying out model When prediction, D_BThe variable used includes five variable f1 to f5 of user, and f6 to f8 has neither part nor lot in prediction, and label is that the user is excellent Matter user or user inferior.Test set is the moding amount that enters of test sample to be inputted trained model, and sentence in prediction Whether the output result of disconnected model is consistent with label.

By 1 example training sample of table, test sample and therein it can enter moding amount and do not enter moding amount as follows. As shown in table 1, these samples of U1, U2 and U3 will participate in the training of model, be properly termed as training set.But participating in model instruction When practicing, f1 only therein to f5 variable is participated in, and is properly termed as into moding amount, and f6 is temporarily not engaged in model to f8 variable and instructs Practice, does not enter moding amount referred to as.Y_AIt is label.For another example, these samples of the U7 in test set and U8 are the predictions for model, will The moding amount that enters in these test samples inputs trained model, and obtains the output result of model.Likewise, U7 and U8 exist When input model, and only f1 is participated in f5 variable, and f6 to f8 variable has neither part nor lot in.If the following table 1 is only example, in actual implementation It is not limited thereto, the variable for including in each sample can change.

1 first data D of table_A(X_A, Y_A)

In the above example, when testing using the test sample in table 1 model, the effect of model is found less Ideal, at this time, it is assumed that have at least two second data sides such as data side B that can provide data, this multiple data side B can possess Different variables, or possess the different variate-values of identical variable.Can by selected section data side B in multiple data side B Lai Assist Optimized model.For example, can be by one optimal data side B of selection in three data side B, or also can choose two Or multiple data side B, decision can be considered according to actual business requirement.And it is more excellent which how to be assessed in these three data sides B, The data selecting method of at least one embodiment of this specification then can be used.

Fig. 2 describes the data selecting method of at least one embodiment of this specification offer, and this method may include as follows It handles, do not limit each step in specific implementation executes sequence:

In step 200, according to training sample, training machine learning model.

This step, which can be used in training sample, enters moding amount and label training pattern.For example, can be in table 1 The data training pattern of U1, U2 and U3, U1, U2 and U3 therein are user's samples, and each user's sample may include eight changes Amount, and in training, five variables of f1 therein to f5 can be used.

In step 202, the moding amount input machine learning model that enters in test sample is obtained into predicted value.

For example, the test sample U7 and U8 in table 1 are not engaged in the training of model, but it can be used for the test of model. It in the model that training is completed in input step 200, can be obtained using five variables of f1 to f5 in test sample as inputting Model exports result, that is, predicted value.The moding amount that enters of tag representation test sample in the test sample inputs machine learning mould The expection predicted value of type.

In step 204, according to the label in predicted value and test sample, the corresponding residual error of the test sample is obtained. For example, the corresponding label of U7 and U8 is the Y in table 1_A7And Y_A8, and residual error can be the difference between predicted value and label, the residual error It can be used to indicate that the difference between the reality output result of model and desired output result, so as to for measuring model Prediction effect.

In step 206, the residual error is sent respectively at least two second data sides to be selected by data side A.This The corresponding residual error of the test sample of data side A can be sent to data side B by step, also that training sample and test sample is corresponding Sample identification be sent to data side B.For example, the sample identification may include the User ID of U1 to U3.

For example, User ID can be encrypted by Encryption Algorithm such as MD5, to avoid user information leakage.What is transmitted is residual Difference is that the gap between original tag is measured, and can also play the purpose of protection privacy of user.

Wherein, it should be noted that in following step, in step 206 to step 214, with data side A to two data It is described for square B, can there is greater number of data side B in actual implementation.In the signal of Fig. 2, to two data side B Sample matches and the regression fit processing that transmission sample identification and residual error and the two data sides B are respectively carried out, have used phase Same step serial number, for example, being all step 206, however, it will be understood that this is that two data side B are respectively executed Operation.

In a step 208, data side B carries out sample matches according to the sample identification, obtains the second data.

For example, data side B can carry out sample matches according to the User ID of U1 and U3, obtains and intend for participating in subsequent return The second data closed.For example, may refer to above-mentioned table 2, the data of U1 and U3 that data side B possesses are obtained, and obtain variable F9 to f11.

May include in second data corresponding data side A training sample and test sample sample ID data.It can User's sample of the sample identification of the training sample of corresponding data side A is also referred to as the training sample in data side B, will correspond to User's sample of the sample identification of the test sample of data side A is known as the test sample in data side B.

In step 210, data side B is fitted the residual error based on the variable regression in second data, is returned Evaluation index.

For example, multiple user's samples in test sample, each sample can correspond to a residual error, and multiple samples can To obtain multiple residual errors.Each variable regression in the training sample of data side B can be used and be fitted above-mentioned multiple residual errors.It is quasi- The purpose of conjunction is to fit a polynomial function according to training sample, this function can be good at being fitted above-mentioned Multiple residual errors.

For example, it is assumed that above-mentioned multiple residual errors may include y₁、y₂……y_n.Wherein, n is natural number.

Variable in each training sample may include: x₁、x₂……x_i.Wherein, i is natural number.

y₁=a₁*x₁₁+a₂*x₁₂+…….a_i*x_1i；……(1)

y₂=a₁*x₂₁+a₂*x₂₂+…….a_i*x_2i；……(2)

……………

y_n=a₁*x_n1+a₂*x_n2+…….a_i*x_ni；……(n)

Wherein, each residual error y₁To y_nBe it is known, the value of the variable in each training sample is also known, for example, { x in above-mentioned formula (1)₁₁、x₁₂……x_1nBe each variable in a training sample value, { the x in formula (2)₂₁、 x₂₂……x_2nBe each variable in another training sample value.It, can be with by above-mentioned formula (1) to formula (n) Obtain coefficient a₁、a₂……a_iValue, finally obtain regression equation y=a₁*x₁+a₂*x₂+…….a_i*x_i。

The corresponding variable importance weight of the available each variable of the regression equation acquired, above-mentioned a₁、a₂…… a_iValue be the corresponding variable importance weight of each variable.

It should be noted that above-mentioned citing is by taking linear regression as an example, however, it is not limited to this.Other can also be used Recurrence mode, e.g., polynomial regression.

Also, the recurrence evaluation index of this recurrence can also be calculated.Return evaluation index can there are many, for example, can To be mean square error, root-mean-square error (Root Mean Squard Error, RMSE), mean absolute error etc..Return evaluation Index can be used for measuring the effect of regression fit.

For example, returning evaluation index by taking mean square error as an example:

In formula (5), m indicates the quantity of test sample, y_iIndicate true value, y_nIndicate predicted value, true value and prediction Value makes the difference, then square after sum-average arithmetic.For example, for each test sample, the corresponding residual error of each test sample, with For one of test sample, the corresponding residual error of the test sample is exactly true value, and uses the variable in the test sample Value substitute into regression equation obtained above, obtained residual values are exactly predicted value.According to above-mentioned formula (5), to each survey The true value and predicted value of sample sheet make the difference, and square after sum-average arithmetic, it can obtain return evaluation index mean square error.

In the step 212, the second data side returns to the first data side for evaluation index is returned.It is described in this step The recurrence evaluation index that oneself is calculated can be returned to data side A respectively by least two data side B.

In addition, data side B can also obtain at least one of following parameter: sample matches rate and the variable missing of the second data Rate.Wherein, the sample matches rate, which can be understood as data side B, can find the data that the data side A of much ratios is required, For example, the sample identification that data side A is transmitted to data side B there are eight, that is, data side B is required to provide user's sample of eight users. And data side B only has 6, then sample matching rate can be 6/8*100%=75%.The variable miss rate is understood that Are as follows: data side B can find some variable of data side A requirement, only some missings of variate-value.For example, the data side side B has 10 The data of a user's sample, all there are also variable f10 for this 10 user's samples, but wherein there are two variable of the user at f10 Value is sky, that is, variable missing occurs, variable miss rate can be 20%.

Data side B can will return evaluation index and return to data side A, can also lack the sample matches rate and variable At least one of mistake rate returns to data side A, so that the first data side is in conjunction with recurrence evaluation index, the sample matches rate The selection of data side is carried out with variable miss rate.

In step 214, the first data side is selected by comparing the recurrence evaluation index of multiple second data sides to determine The second data side of part.

In this step, data side A can be individually according to the comparison for returning evaluation index, for example, can be by two data side B The recurrence evaluation index of return compares, which index is more excellent just to select by which data side B.It can certainly select to return and evaluate The preferably multiple data side B of index.

Alternatively, sample matches rate, variable miss rate can also be comprehensively considered and return evaluation index, for example, can first select The second data side that sample matches rate is higher than preset threshold is selected out, matching rate is lower to be given up.It is high by sample matches rate again The selective goal preferably data side B in the second data of preset threshold is ranked up for example, evaluation index can will be returned, Data side B of the selected and sorted at former.Certainly, in other examples, can also comprehensively consider again variable miss rate etc. its His index.For example, can be sample matches rate and sample miss rate given threshold, no matter the second data recurrence lower than threshold value is commented How is valence index, not reselection.

A variety of regression algorithms can be used in above-mentioned recurrence, but are the need to ensure that each data side B uses unification Regression algorithm and unified recurrence evaluation index, avoid due to each data side B select Different Effects subsequent contrast it is just.

In addition, the judgement of the data validity of this step, can be computer automatic execution, it is also possible to manually perform, For example, data side B by sample matches rate, sample miss rate and is being returned after evaluation index returns to data side A, by data side A Administrative staff judged according to these indexs returned, to carry out the selection of data side B.

The residual error of modeling is only sent to by the data selecting method of this specification one or more embodiment, data side A Multiple data side B, multiple data side B also will only return evaluation index and return to data side A, and the interaction of data side is that modeling is residual Difference and recurrence evaluation index, and the private data of non-user, therefore any of user can not be revealed in both sides' interactive process Private data.Also, it can also be according to the recurrence evaluation index that multiple data side B are returned by selected section in multiple data side B Data protect internal data privacy while realizing the selected section data in by multiple external datas.

Fig. 3 is the data selection means that at least one embodiment of this specification provides, and described device is applied to by offer number According at least two second data sides in the second data side of selected section；Described device be applied to the first data side, described first The first data that data side possesses include: the training set and test set of machine learning model；The training set includes multiple training Sample, the test set include multiple test samples.As shown in figure 3, the apparatus may include: model training module 31, model Prediction module 32, residual computations module 33, data transmission blocks 34 and verification processing module 35.

Model training module 31, for entering moding amount and label, the training engineering according in the training sample Practise model；The training sample further include have neither part nor lot in machine learning model training do not enter moding amount.

Model prediction module 32 inputs the machine learning model for will enter moding amount described in the test sample Obtain predicted value；The test sample further includes label, and the tag representation test sample enters the input machine learning of moding amount The expection predicted value of model.

Residual computations module 33, for according to the test sample label and the predicted value, obtain test sample pair The residual error answered.

Data transmission blocks 34, for the residual error to be respectively sent at least two second data side, so that Each second data side is fitted the residual error using the second data regression possessed respectively, and obtains returning evaluation index；

Verification processing module 35, the recurrence evaluation index returned respectively for receiving at least two second data side, By comparing the second data side of recurrence evaluation index selected section of at least two second data sides.

In one example, verification processing module 35 is also used to receive the sample matches rate of the second data side return；By sample This matching rate is higher than in the second data of preset threshold, according to recurrence evaluation index by selector at least two second data sides The second data for dividing the second data side to possess.

Fig. 4 is another data selection means that provide of at least one embodiment of this specification, and described device is applied to the Two data sides, as shown in figure 4, the apparatus may include: residual error receiving module 41, returns processing module at data match module 42 43 and index feedback module 44.

Residual error receiving module 41, for receiving the residual error of the first data side transmission, the residual error is the first data root evidence The predicted value that moding amount input machine learning model obtains of entering in test sample and label obtain；The first data side possesses Data include: training set and test set, the training set includes multiple training samples, and the test set includes multiple test specimens This；The machine learning model be according in training sample enter moding amount and label training obtains；In the training sample also Including not entering moding amount.

Data match module 42, for receiving the corresponding sample identification of the training sample, and according to the sample identification It carries out sample matches and obtains the second data for participating in regression fit.

Processing module 43 is returned, for being fitted the residual error based on second data regression, obtains returning evaluation index；

Index feedback module 44, for the recurrence evaluation index to be returned to the first data side, so that the first data Square the second data side of recurrence evaluation index selected section by comparing at least two second data sides to be selected.

This specification embodiment additionally provides a kind of verifying equipment of data validity, and the equipment application is in the first data Side, the equipment include memory, processor and storage on a memory and the computer program that can run on a processor, institute It states when processor executes described program and performs the steps of

This specification embodiment additionally provides a kind of verifying equipment of data validity, and the equipment application is in the second data Side, the equipment include memory, processor and storage on a memory and the computer program that can run on a processor, institute It states when processor executes described program and performs the steps of

The corresponding sample identification of the training sample is received, and sample matches are carried out according to the sample identification and are used for Participate in the second data of regression fit；

Each step in process shown in above method embodiment, execution sequence are not limited to suitable in flow chart Sequence.In addition, the description of each step, can be implemented as software, hardware or its form combined, for example, those skilled in the art Member can implement these as the form of software code, can be can be realized the computer of the corresponding logic function of the step can It executes instruction.When it is realized in the form of software, the executable instruction be can store in memory, and by equipment Processor execute.

The device or module that above-described embodiment illustrates can specifically realize by computer chip or entity, or by having The product of certain function is realized.A kind of typically to realize that equipment is computer, the concrete form of computer can be personal meter Calculation machine, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation are set It is any several in standby, E-mail receiver/send equipment, game console, tablet computer, wearable device or these equipment The combination of equipment.

For convenience of description, it is divided into various modules when description apparatus above with function to describe respectively.Certainly, implementing this The function of each module can be realized in the same or multiple software and or hardware when specification one or more embodiment.

It should be understood by those skilled in the art that, this specification one or more embodiment can provide for method, system or Computer program product.Therefore, complete hardware embodiment can be used in this specification one or more embodiment, complete software is implemented The form of example or embodiment combining software and hardware aspects.Moreover, this specification one or more embodiment can be used one It is a or it is multiple wherein include computer usable program code computer-usable storage medium (including but not limited to disk storage Device, CD-ROM, optical memory etc.) on the form of computer program product implemented.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want There is also other identical elements in the process, method of element, commodity or equipment.

This specification one or more embodiment can computer executable instructions it is general on It hereinafter describes, such as program module.Generally, program module includes executing particular task or realization particular abstract data type Routine, programs, objects, component, data structure etc..Can also practice in a distributed computing environment this specification one or Multiple embodiments, in these distributed computing environments, by being executed by the connected remote processing devices of communication network Task.In a distributed computing environment, the local and remote computer that program module can be located at including storage equipment is deposited In storage media.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.It is adopted especially for data For collecting equipment or data processing equipment embodiment, since it is substantially similar to the method embodiment, so the comparison of description is simple Single, the relevent part can refer to the partial explaination of embodiments of method.

It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.

The foregoing is merely the preferred embodiments of this specification one or more embodiment, not to limit this public affairs It opens, all within the spirit and principle of the disclosure, any modification, equivalent substitution, improvement and etc. done should be included in the disclosure Within the scope of protection.

Claims

1. a kind of data selecting method, the method is applied to by selected section at least two second data sides of offer data Second data side；The method is executed by the first data side, and the first data that the first data side possesses include: machine learning The training set and test set of model；The training set includes multiple training samples, and the test set includes multiple test samples；

The described method includes:

Enter moding amount and label, the training machine learning model according in training sample；The training sample further includes not Participate in machine learning model training does not enter moding amount；

It moding amount will be entered described in the test sample inputs the machine learning model to obtain predicted value；The test sample It further include label, the expection predicted value for entering moding amount input machine learning model of the tag representation test sample；

The residual error is respectively sent at least two second data side, so that each second data side is respectively using gathering around The second data regression having is fitted the residual error, and obtains returning evaluation index；

The recurrence evaluation index that at least two second data side returns respectively is received, by comparing described at least two the The second data side of recurrence evaluation index selected section of two data sides.

2. according to the method described in claim 1, the method also includes:

The sample identification of the training sample and test sample is sent to the second data side, so that the second data root is according to institute It states sample identification progress sample matches and obtains second data.

3. according to the method described in claim 1, the method also includes:

Receive the sample matches rate of the second data side return；

It is higher than in the second data of preset threshold by sample matches rate, according to recurrence evaluation index by least two second data sides The second data that middle the second data side of selected section possesses.

4. a kind of data selecting method, the method is executed by the second data side, comprising:

The residual error of the first data side transmission is received, the residual error is that the first data root is inputted according to the moding amount that enters in test sample The label of predicted value and test sample that machine learning model obtains obtains；The data that the first data side possesses include: Training set and test set, the training set include multiple training samples, and the test set includes multiple test samples；The machine Learning model be according in training sample enter moding amount and label training obtains；It further include not entering moding in the training sample Amount；

The sample identification of the first data side transmission is received, and sample matches are carried out according to the sample identification and are obtained for participating in back Return the second data of fitting；

The recurrence evaluation index is returned into the first data side, so that the first data side is by comparing to be selected at least two The second data side of recurrence evaluation index selected section of a second data side.

5. according to the method described in claim 4, the method also includes:

Obtain at least one following parameter of second data: sample matches rate and variable miss rate；

The parameter is returned into the first data side, so that the first data side is in conjunction with the parameter and returns evaluation index selection The second data side of part.

6. a kind of data selection means, described device is applied to by selected section at least two second data sides of offer data Second data side；Described device is applied to the first data side, and the first data that the first data side possesses include: machine learning The training set and test set of model；The training set includes multiple training samples, and the test set includes multiple test samples；Institute Stating device includes:

Model training module, for entering moding amount and label, the training machine learning model according in the training sample； The training sample further include have neither part nor lot in machine learning model training do not enter moding amount；

Model prediction module, for will enter described in the test sample moding amount input the machine learning model obtain it is pre- Measured value；The test sample further includes label, and the moding amount that enters of the tag representation test sample inputs machine learning model It is expected that predicted value；

Residual computations module, for according to the test sample label and the predicted value, it is corresponding residual to obtain test sample Difference；

Data transmission blocks, for the residual error to be respectively sent at least two second data side, so that each Two data sides are fitted the residual error using the second data regression possessed respectively, and obtain returning evaluation index；

Verification processing module, the recurrence evaluation index returned respectively for receiving at least two second data side, to pass through Compare the second data side of recurrence evaluation index selected section of at least two second data side.

7. device according to claim 6,

The verification processing module is also used to receive the sample matches rate of the second data side return；It is higher than by sample matches rate pre- If in the second data of threshold value, being gathered around according to evaluation index is returned by the second data side of selected section at least two second data sides The second data having.

8. a kind of data selection means, described device is applied to the second data side, which includes:

Residual error receiving module, for receiving the residual error of the first data side transmission, the residual error is the first data root according to test specimens The predicted value that moding amount input machine learning model obtains of entering in this and label obtain；The data that the first data side possesses It include: training set and test set, the training set includes multiple training samples, and the test set includes multiple test samples；Institute State machine learning model be according in training sample enter moding amount and label training obtains；It further include not in the training sample Enter moding amount；

Data match module, the sample identification sent for receiving the first data side, and carried out according to the sample identification Sample matches obtain the second data for participating in regression fit；

Index feedback module, for the recurrence evaluation index to be returned to the first data side, so that the first data side passes through Compare the second data side of recurrence evaluation index selected section of at least two second data sides to be selected.

9. a kind of data selection equipment, the equipment include memory, processor and storage on a memory and can be in processor The computer program of upper operation, the processor perform the steps of when executing described program

Enter moding amount and label, training machine learning model according in training sample；The training sample further includes having neither part nor lot in Machine learning model training does not enter moding amount；

10. a kind of data selection equipment, the equipment include memory, processor and storage on a memory and can be in processor The computer program of upper operation, the processor perform the steps of when executing described program

The sample identification that the first data side is sent is received, and sample matches are carried out according to the sample identification and are obtained for joining With the second data of regression fit；