CN109299161A - A kind of data selecting method and device - Google Patents

A kind of data selecting method and device Download PDF

Info

Publication number
CN109299161A
CN109299161A CN201811286327.4A CN201811286327A CN109299161A CN 109299161 A CN109299161 A CN 109299161A CN 201811286327 A CN201811286327 A CN 201811286327A CN 109299161 A CN109299161 A CN 109299161A
Authority
CN
China
Prior art keywords
data
sample
data side
training
evaluation index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811286327.4A
Other languages
Chinese (zh)
Other versions
CN109299161B (en
Inventor
方文静
王力
周俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811286327.4A priority Critical patent/CN109299161B/en
Publication of CN109299161A publication Critical patent/CN109299161A/en
Application granted granted Critical
Publication of CN109299161B publication Critical patent/CN109299161B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

This specification embodiment provides a kind of data selecting method and device, wherein method may include: to enter moding amount and label, training machine learning model according in training sample;Training sample further includes not entering moding amount;The moding amount input machine learning model that enters in test sample is obtained into predicted value;Test sample further includes label;According to the label and predicted value of test sample, the corresponding residual error of test sample is obtained;Residual error is respectively sent at least two second data sides, so that each second data side is respectively using the second data regression regression criterion possessed, and obtains returning evaluation index;The recurrence evaluation index that at least two second data sides return respectively is received, by comparing the second data side of recurrence evaluation index selected section of at least two second data sides.

Description

A kind of data selecting method and device
Technical field
This disclosure relates to big data technical field, in particular to a kind of data selecting method and device.
Background technique
With the rapid development of Internet technology, entire society is forcibly pushed into " big data " epoch.No matter whether people It is ready, our personal data are just inadvertently passively collected and used by enterprise, individual.The networking of personal data and Transparence has become irresistible main trend.At the same time, user data is also dangerous " box of Pandora ", data one Denier leakage, the privacy of user will be invaded.In recent years, a lot of privacy of user leakage events are had occurred that, the individual's of citizen Private data guard encounters stern challenge.Big data bring globality is changed, so that individual consumer is difficult confrontation individual The risk that privacy is exposed comprehensively.In face of the privacy leakage event to take place frequently, Privacy Protection needs to obtain effective solution.
In practical business, we are likely encountered such scene: needing to come by the variable data of third party's channel The effect of existing model is promoted, only when these data, which model us, understands helpful, just buys corresponding third number formulary According to.It would therefore be desirable to judge its validity in advance in the case where not obtaining third party's data, and in this process not The private data of our user can be revealed.
Summary of the invention
In view of this, this specification one or more embodiment provides a kind of data selecting method and device, by more Internal data privacy is protected in a external data while selected section data.
Specifically, this specification one or more embodiment is achieved by the following technical solution:
In a first aspect, providing a kind of data selecting method, the method is applied to by at least two second of offer data The second data side of selected section in data side;The method is executed by the first data side, the first data side possess first Data include: the training set and test set of machine learning model;The training set includes multiple training samples, the test set packet Include multiple test samples;The described method includes:
Enter moding amount and label, the training machine learning model according in training sample;The training sample also wraps Include have neither part nor lot in machine learning model training do not enter moding amount;
It moding amount will be entered described in the test sample inputs the machine learning model to obtain predicted value;The test Sample further includes label, the expection predicted value for entering moding amount input machine learning model of the tag representation test sample;
According to the label of the test sample and the predicted value, the corresponding residual error of the test sample is obtained;
The residual error is respectively sent at least two second data side, so that each second data side makes respectively It is fitted the residual error with the second data regression possessed, and obtains returning evaluation index;
The recurrence evaluation index that at least two second data side returns respectively is received, by comparing described at least two The second data side of recurrence evaluation index selected section of a second data side.
Second aspect, provides a kind of verification method of data validity, and the method is executed by the second data side, comprising:
The residual error of the first data side transmission is received, the residual error is that the first data root according in test sample enters moding amount The label of predicted value and test sample that input machine learning model obtains obtains;The data packet that the first data side possesses Include: training set and test set, the training set include multiple training samples, and the test set includes multiple test samples;It is described Machine learning model be according in training sample enter moding amount and label training obtains;It further include not entering in the training sample Moding amount;
The sample identification of the first data side transmission is received, and sample matches are carried out according to the sample identification and are obtained for joining With the second data of regression fit;
It is fitted the residual error based on second data regression, obtains returning evaluation index;
The recurrence evaluation index is returned into the first data side so that the first data side by comparing it is to be selected extremely The second data side of recurrence evaluation index selected section of few two the second data sides.
The third aspect, provides a kind of verifying device of data validity, and described device is applied to by offer data at least The second data side of selected section in two the second data sides;Described device is applied to the first data side, and the first data side is gathered around The first data having include: the training set and test set of machine learning model;The training set includes multiple training samples, described Test set includes multiple test samples;Described device includes:
Model training module, for entering moding amount and label, the training machine learning according in the training sample Model;The training sample further include have neither part nor lot in machine learning model training do not enter moding amount;
Model prediction module inputs the machine learning model for will enter moding amount described in the test sample and obtains To predicted value;The test sample further includes label, and the tag representation test sample enters moding amount input machine learning mould The expection predicted value of type;
Residual computations module, for according to the test sample label and the predicted value, it is corresponding to obtain test sample Residual error;
Data transmission blocks, for the residual error to be respectively sent at least two second data side, so that respectively A second data side is fitted the residual error using the second data regression possessed respectively, and obtains returning evaluation index;
Verification processing module, the recurrence evaluation index returned respectively for receiving at least two second data side, with By comparing the second data side of recurrence evaluation index selected section of at least two second data side.
Fourth aspect, provides a kind of verifying device of data validity, and described device is applied to the second data side, the device Include:
Residual error receiving module, for receiving the residual error of the first data side transmission, the residual error is the first data root according to survey The predicted value that moding amount input machine learning model obtains of entering in sample sheet and label obtain;What the first data side possessed Data include: training set and test set, and the training set includes multiple training samples, and the test set includes multiple test specimens This;The machine learning model be according in training sample enter moding amount and label training obtains;In the training sample also Including not entering moding amount;
Data match module, the sample identification sent for receiving the first data side, and according to the sample identification It carries out sample matches and obtains the second data for participating in regression fit;
Processing module is returned, for being fitted the residual error based on second data regression, obtains returning evaluation index;
Index feedback module returns to the first data side for that will return evaluation index, so that the first data side passes through Compare the second data side of recurrence evaluation index selected section of at least two second data sides to be selected.
5th aspect, provides a kind of verifying equipment of data validity, the equipment includes memory, processor and storage On a memory and the computer program that can run on a processor, the processor realize following step when executing described program It is rapid:
Enter moding amount and label, training machine learning model according in training sample;The training sample further includes not Participate in machine learning model training does not enter moding amount;
It moding amount will be entered described in the test sample inputs the machine learning model to obtain predicted value;The test Sample further includes label, the expection predicted value for entering moding amount input machine learning model of the tag representation test sample;
According to the label of the test sample and the predicted value, the corresponding residual error of the test sample is obtained;
The residual error is respectively sent at least two second data side, so that each second data side makes respectively It is fitted the residual error with the second data regression possessed, and obtains returning evaluation index;
The recurrence evaluation index that at least two second data side returns respectively is received, by comparing described at least two The second data side of recurrence evaluation index selected section of a second data side.
6th aspect, provides a kind of verifying equipment of data validity, the equipment includes memory, processor and storage On a memory and the computer program that can run on a processor, the processor realize following step when executing described program It is rapid:
The residual error of the first data side transmission is received, the residual error is that the first data root according in test sample enters moding amount The label of predicted value and test sample that input machine learning model obtains obtains;The data packet that the first data side possesses Include: training set and test set, the training set include multiple training samples, and the test set includes multiple test samples;It is described Machine learning model be according in training sample enter moding amount and label training obtains;It further include not entering in the training sample Moding amount;
The sample identification that the first data side is sent is received, and sample matches are carried out according to the sample identification and are used In the second data for participating in regression fit;
It is fitted the residual error based on second data regression, obtains returning evaluation index;
The recurrence evaluation index is returned into the first data side so that the first data side by comparing it is to be selected extremely The second data side of recurrence evaluation index selected section of few two the second data sides.
The data selecting method and device of this specification one or more embodiment pass through two data side's interactive modelings Residual sum returns evaluation index, and the private data of non-user, therefore appointing for user can not be revealed in both sides' interactive process What private data.Also, it can also be according to the recurrence evaluation index that multiple data sides return by selected section in multiple data sides Data protect internal data privacy while realizing the selected section data in by multiple external datas.
Detailed description of the invention
In order to illustrate more clearly of this specification one or more embodiment or technical solution in the prior art, below will A brief introduction will be made to the drawings that need to be used in the embodiment or the description of the prior art, it should be apparent that, it is described below Attached drawing is only some embodiments recorded in this specification one or more embodiment, and those of ordinary skill in the art are come It says, without any creative labor, is also possible to obtain other drawings based on these drawings.
Fig. 1 is the data set schematic diagram that this specification one or more embodiment provides;
Fig. 2 is the data selecting method that this specification one or more embodiment provides;
Fig. 3 is a kind of data selection means that this specification one or more embodiment provides;
Fig. 4 is another data selection means that this specification one or more embodiment provides.
Specific embodiment
In order to make those skilled in the art more fully understand the technical solution in this specification one or more embodiment, Below in conjunction with the attached drawing in this specification one or more embodiment, to the technology in this specification one or more embodiment Scheme is clearly and completely described, it is clear that described embodiment is only a part of the embodiment, rather than whole realities Apply example.Based on this specification one or more embodiment, those of ordinary skill in the art are not making creative work premise Under every other embodiment obtained, shall fall within the protection scope of the present application.
In practical business, be likely encountered such scene: data side A possesses own data, it is desirable to if evaluation and test By the data of data side B, the modelling effect of itself can be promoted.For example, it is assumed that data side A utilizes owned number It according to a machine learning model M is had trained, still, is found in model measurement, the prediction effect of the model is not satisfactory, and pre- Phase predicted value has a certain distance.If participating in the training and optimization of model M using the data of data side B, mould can be made The effect of type M is promoted, then can choose the data of purchase data side B to assist modeling.
In above-mentioned scene, it is involved in a problem i.e.: how to determine whether data side B is effective, if data side B Data it is helpful to the modeling of model M, facilitate the effect of lift scheme M, then confirm that the data of data side B are effective. And the data validity of which kind of mode verify data side B is used, it will be at least one embodiment of this specification content to be described, Also, in the verification method of data validity, will realize: data side A does not obtain the data of data side B, and data side A is not let out That reveals itself possesses data.
As follows by taking data side A and data side B as an example, the verification method of data validity is described, and this method will verify number It is whether effective according to the data of square B.
For example, data side A can be known as to the first data side, data side B is known as the second data side.
Firstly, shown in Figure 1, the data that the first data side possesses are properly termed as the first data.In first data It may include: the training set and test set of machine learning model.
Wherein, training set is used for the training of machine learning model, for example, the training sample D in the training setA(XA, YA) In, XAIt is variable, YAIt is label.The label YAIndicate above-mentioned variable XABy the expection predicted value of the machine learning model, It is equivalent to a kind of model for having supervision.
Test set is used for the prediction of machine learning model, for example, the test sample D in test setB(XB, YB) equally include Variable and label.
For example, the variable of above-mentioned training sample and test sample, it can include " entering moding amount " and " do not enter moding Amount ".Wherein, the training for entering moding amount and taking part in model in training sample, and the moding amount that enters in test sample takes part in mould Type prediction, and do not enter training and prediction that moding amount is not engaged in model.
Be exemplified below: for judging that some user is high-quality user or user inferior, which can use multiple changes Amount indicates, for example, age, address, length of service, annual income etc..Assuming that a user can be indicated with 8 variables, U f1, F2, f3, f4 ... .f8 } it be a user U include this eight variables of f1 to f8.It, can be first using wherein in training pattern Five variable f1 to f5, and f6 to f8 is temporarily first not involved in the training of model.
So, in training sample DA(XA, YA) in, it may include multiple user's samples, for example, user U1, user U2, use Family U3 etc..Each user's sample is DA(XA, YA), including variable and label, variable X thereinAIt may include above-mentioned use Five variable f1 to f5 at family, the variable in each user's sample are this five variables, and variate-value can be different;And it is described Label YACan be the user is high-quality user or user inferior, for example, high-quality user is indicated with 11,00 table of user inferior Show.
The test sample D of prediction for machine learning modelB(XB, YB) it equally include variable and label, carrying out model When prediction, DBThe variable used includes five variable f1 to f5 of user, and f6 to f8 has neither part nor lot in prediction, and label is that the user is excellent Matter user or user inferior.Test set is the moding amount that enters of test sample to be inputted trained model, and sentence in prediction Whether the output result of disconnected model is consistent with label.
By 1 example training sample of table, test sample and therein it can enter moding amount and do not enter moding amount as follows. As shown in table 1, these samples of U1, U2 and U3 will participate in the training of model, be properly termed as training set.But participating in model instruction When practicing, f1 only therein to f5 variable is participated in, and is properly termed as into moding amount, and f6 is temporarily not engaged in model to f8 variable and instructs Practice, does not enter moding amount referred to as.YAIt is label.For another example, these samples of the U7 in test set and U8 are the predictions for model, will The moding amount that enters in these test samples inputs trained model, and obtains the output result of model.Likewise, U7 and U8 exist When input model, and only f1 is participated in f5 variable, and f6 to f8 variable has neither part nor lot in.If the following table 1 is only example, in actual implementation It is not limited thereto, the variable for including in each sample can change.
1 first data D of tableA(XA, YA)
In the above example, when testing using the test sample in table 1 model, the effect of model is found less Ideal, at this time, it is assumed that have at least two second data sides such as data side B that can provide data, this multiple data side B can possess Different variables, or possess the different variate-values of identical variable.Can by selected section data side B in multiple data side B Lai Assist Optimized model.For example, can be by one optimal data side B of selection in three data side B, or also can choose two Or multiple data side B, decision can be considered according to actual business requirement.And it is more excellent which how to be assessed in these three data sides B, The data selecting method of at least one embodiment of this specification then can be used.
Fig. 2 describes the data selecting method of at least one embodiment of this specification offer, and this method may include as follows It handles, do not limit each step in specific implementation executes sequence:
In step 200, according to training sample, training machine learning model.
This step, which can be used in training sample, enters moding amount and label training pattern.For example, can be in table 1 The data training pattern of U1, U2 and U3, U1, U2 and U3 therein are user's samples, and each user's sample may include eight changes Amount, and in training, five variables of f1 therein to f5 can be used.
In step 202, the moding amount input machine learning model that enters in test sample is obtained into predicted value.
For example, the test sample U7 and U8 in table 1 are not engaged in the training of model, but it can be used for the test of model. It in the model that training is completed in input step 200, can be obtained using five variables of f1 to f5 in test sample as inputting Model exports result, that is, predicted value.The moding amount that enters of tag representation test sample in the test sample inputs machine learning mould The expection predicted value of type.
In step 204, according to the label in predicted value and test sample, the corresponding residual error of the test sample is obtained. For example, the corresponding label of U7 and U8 is the Y in table 1A7And YA8, and residual error can be the difference between predicted value and label, the residual error It can be used to indicate that the difference between the reality output result of model and desired output result, so as to for measuring model Prediction effect.
In step 206, the residual error is sent respectively at least two second data sides to be selected by data side A.This The corresponding residual error of the test sample of data side A can be sent to data side B by step, also that training sample and test sample is corresponding Sample identification be sent to data side B.For example, the sample identification may include the User ID of U1 to U3.
For example, User ID can be encrypted by Encryption Algorithm such as MD5, to avoid user information leakage.What is transmitted is residual Difference is that the gap between original tag is measured, and can also play the purpose of protection privacy of user.
Wherein, it should be noted that in following step, in step 206 to step 214, with data side A to two data It is described for square B, can there is greater number of data side B in actual implementation.In the signal of Fig. 2, to two data side B Sample matches and the regression fit processing that transmission sample identification and residual error and the two data sides B are respectively carried out, have used phase Same step serial number, for example, being all step 206, however, it will be understood that this is that two data side B are respectively executed Operation.
In a step 208, data side B carries out sample matches according to the sample identification, obtains the second data.
For example, data side B can carry out sample matches according to the User ID of U1 and U3, obtains and intend for participating in subsequent return The second data closed.For example, may refer to above-mentioned table 2, the data of U1 and U3 that data side B possesses are obtained, and obtain variable F9 to f11.
May include in second data corresponding data side A training sample and test sample sample ID data.It can User's sample of the sample identification of the training sample of corresponding data side A is also referred to as the training sample in data side B, will correspond to User's sample of the sample identification of the test sample of data side A is known as the test sample in data side B.
In step 210, data side B is fitted the residual error based on the variable regression in second data, is returned Evaluation index.
For example, multiple user's samples in test sample, each sample can correspond to a residual error, and multiple samples can To obtain multiple residual errors.Each variable regression in the training sample of data side B can be used and be fitted above-mentioned multiple residual errors.It is quasi- The purpose of conjunction is to fit a polynomial function according to training sample, this function can be good at being fitted above-mentioned Multiple residual errors.
For example, it is assumed that above-mentioned multiple residual errors may include y1、y2……yn.Wherein, n is natural number.
Variable in each training sample may include: x1、x2……xi.Wherein, i is natural number.
y1=a1*x11+a2*x12+…….ai*x1i;……(1)
y2=a1*x21+a2*x22+…….ai*x2i;……(2)
……………
yn=a1*xn1+a2*xn2+…….ai*xni;……(n)
Wherein, each residual error y1To ynBe it is known, the value of the variable in each training sample is also known, for example, { x in above-mentioned formula (1)11、x12……x1nBe each variable in a training sample value, { the x in formula (2)21、 x22……x2nBe each variable in another training sample value.It, can be with by above-mentioned formula (1) to formula (n) Obtain coefficient a1、a2……aiValue, finally obtain regression equation y=a1*x1+a2*x2+…….ai*xi
The corresponding variable importance weight of the available each variable of the regression equation acquired, above-mentioned a1、a2…… aiValue be the corresponding variable importance weight of each variable.
It should be noted that above-mentioned citing is by taking linear regression as an example, however, it is not limited to this.Other can also be used Recurrence mode, e.g., polynomial regression.
Also, the recurrence evaluation index of this recurrence can also be calculated.Return evaluation index can there are many, for example, can To be mean square error, root-mean-square error (Root Mean Squard Error, RMSE), mean absolute error etc..Return evaluation Index can be used for measuring the effect of regression fit.
For example, returning evaluation index by taking mean square error as an example:
In formula (5), m indicates the quantity of test sample, yiIndicate true value, ynIndicate predicted value, true value and prediction Value makes the difference, then square after sum-average arithmetic.For example, for each test sample, the corresponding residual error of each test sample, with For one of test sample, the corresponding residual error of the test sample is exactly true value, and uses the variable in the test sample Value substitute into regression equation obtained above, obtained residual values are exactly predicted value.According to above-mentioned formula (5), to each survey The true value and predicted value of sample sheet make the difference, and square after sum-average arithmetic, it can obtain return evaluation index mean square error.
In the step 212, the second data side returns to the first data side for evaluation index is returned.It is described in this step The recurrence evaluation index that oneself is calculated can be returned to data side A respectively by least two data side B.
In addition, data side B can also obtain at least one of following parameter: sample matches rate and the variable missing of the second data Rate.Wherein, the sample matches rate, which can be understood as data side B, can find the data that the data side A of much ratios is required, For example, the sample identification that data side A is transmitted to data side B there are eight, that is, data side B is required to provide user's sample of eight users. And data side B only has 6, then sample matching rate can be 6/8*100%=75%.The variable miss rate is understood that Are as follows: data side B can find some variable of data side A requirement, only some missings of variate-value.For example, the data side side B has 10 The data of a user's sample, all there are also variable f10 for this 10 user's samples, but wherein there are two variable of the user at f10 Value is sky, that is, variable missing occurs, variable miss rate can be 20%.
Data side B can will return evaluation index and return to data side A, can also lack the sample matches rate and variable At least one of mistake rate returns to data side A, so that the first data side is in conjunction with recurrence evaluation index, the sample matches rate The selection of data side is carried out with variable miss rate.
In step 214, the first data side is selected by comparing the recurrence evaluation index of multiple second data sides to determine The second data side of part.
In this step, data side A can be individually according to the comparison for returning evaluation index, for example, can be by two data side B The recurrence evaluation index of return compares, which index is more excellent just to select by which data side B.It can certainly select to return and evaluate The preferably multiple data side B of index.
Alternatively, sample matches rate, variable miss rate can also be comprehensively considered and return evaluation index, for example, can first select The second data side that sample matches rate is higher than preset threshold is selected out, matching rate is lower to be given up.It is high by sample matches rate again The selective goal preferably data side B in the second data of preset threshold is ranked up for example, evaluation index can will be returned, Data side B of the selected and sorted at former.Certainly, in other examples, can also comprehensively consider again variable miss rate etc. its His index.For example, can be sample matches rate and sample miss rate given threshold, no matter the second data recurrence lower than threshold value is commented How is valence index, not reselection.
A variety of regression algorithms can be used in above-mentioned recurrence, but are the need to ensure that each data side B uses unification Regression algorithm and unified recurrence evaluation index, avoid due to each data side B select Different Effects subsequent contrast it is just.
In addition, the judgement of the data validity of this step, can be computer automatic execution, it is also possible to manually perform, For example, data side B by sample matches rate, sample miss rate and is being returned after evaluation index returns to data side A, by data side A Administrative staff judged according to these indexs returned, to carry out the selection of data side B.
The residual error of modeling is only sent to by the data selecting method of this specification one or more embodiment, data side A Multiple data side B, multiple data side B also will only return evaluation index and return to data side A, and the interaction of data side is that modeling is residual Difference and recurrence evaluation index, and the private data of non-user, therefore any of user can not be revealed in both sides' interactive process Private data.Also, it can also be according to the recurrence evaluation index that multiple data side B are returned by selected section in multiple data side B Data protect internal data privacy while realizing the selected section data in by multiple external datas.
Fig. 3 is the data selection means that at least one embodiment of this specification provides, and described device is applied to by offer number According at least two second data sides in the second data side of selected section;Described device be applied to the first data side, described first The first data that data side possesses include: the training set and test set of machine learning model;The training set includes multiple training Sample, the test set include multiple test samples.As shown in figure 3, the apparatus may include: model training module 31, model Prediction module 32, residual computations module 33, data transmission blocks 34 and verification processing module 35.
Model training module 31, for entering moding amount and label, the training engineering according in the training sample Practise model;The training sample further include have neither part nor lot in machine learning model training do not enter moding amount.
Model prediction module 32 inputs the machine learning model for will enter moding amount described in the test sample Obtain predicted value;The test sample further includes label, and the tag representation test sample enters the input machine learning of moding amount The expection predicted value of model.
Residual computations module 33, for according to the test sample label and the predicted value, obtain test sample pair The residual error answered.
Data transmission blocks 34, for the residual error to be respectively sent at least two second data side, so that Each second data side is fitted the residual error using the second data regression possessed respectively, and obtains returning evaluation index;
Verification processing module 35, the recurrence evaluation index returned respectively for receiving at least two second data side, By comparing the second data side of recurrence evaluation index selected section of at least two second data sides.
In one example, verification processing module 35 is also used to receive the sample matches rate of the second data side return;By sample This matching rate is higher than in the second data of preset threshold, according to recurrence evaluation index by selector at least two second data sides The second data for dividing the second data side to possess.
Fig. 4 is another data selection means that provide of at least one embodiment of this specification, and described device is applied to the Two data sides, as shown in figure 4, the apparatus may include: residual error receiving module 41, returns processing module at data match module 42 43 and index feedback module 44.
Residual error receiving module 41, for receiving the residual error of the first data side transmission, the residual error is the first data root evidence The predicted value that moding amount input machine learning model obtains of entering in test sample and label obtain;The first data side possesses Data include: training set and test set, the training set includes multiple training samples, and the test set includes multiple test specimens This;The machine learning model be according in training sample enter moding amount and label training obtains;In the training sample also Including not entering moding amount.
Data match module 42, for receiving the corresponding sample identification of the training sample, and according to the sample identification It carries out sample matches and obtains the second data for participating in regression fit.
Processing module 43 is returned, for being fitted the residual error based on second data regression, obtains returning evaluation index;
Index feedback module 44, for the recurrence evaluation index to be returned to the first data side, so that the first data Square the second data side of recurrence evaluation index selected section by comparing at least two second data sides to be selected.
This specification embodiment additionally provides a kind of verifying equipment of data validity, and the equipment application is in the first data Side, the equipment include memory, processor and storage on a memory and the computer program that can run on a processor, institute It states when processor executes described program and performs the steps of
Enter moding amount and label, training machine learning model according in training sample;The training sample further includes not Participate in machine learning model training does not enter moding amount;
It moding amount will be entered described in the test sample inputs the machine learning model to obtain predicted value;The test Sample further includes label, the expection predicted value for entering moding amount input machine learning model of the tag representation test sample;
According to the label of the test sample and the predicted value, the corresponding residual error of the test sample is obtained;
The residual error is respectively sent at least two second data side, so that each second data side makes respectively It is fitted the residual error with the second data regression possessed, and obtains returning evaluation index;
The recurrence evaluation index that at least two second data side returns respectively is received, by comparing described at least two The second data side of recurrence evaluation index selected section of a second data side.
This specification embodiment additionally provides a kind of verifying equipment of data validity, and the equipment application is in the second data Side, the equipment include memory, processor and storage on a memory and the computer program that can run on a processor, institute It states when processor executes described program and performs the steps of
The residual error of the first data side transmission is received, the residual error is that the first data root according in test sample enters moding amount The label of predicted value and test sample that input machine learning model obtains obtains;The data packet that the first data side possesses Include: training set and test set, the training set include multiple training samples, and the test set includes multiple test samples;It is described Machine learning model be according in training sample enter moding amount and label training obtains;It further include not entering in the training sample Moding amount;
The corresponding sample identification of the training sample is received, and sample matches are carried out according to the sample identification and are used for Participate in the second data of regression fit;
It is fitted the residual error based on second data regression, obtains returning evaluation index;
The recurrence evaluation index is returned into the first data side so that the first data side by comparing it is to be selected extremely The second data side of recurrence evaluation index selected section of few two the second data sides.
Each step in process shown in above method embodiment, execution sequence are not limited to suitable in flow chart Sequence.In addition, the description of each step, can be implemented as software, hardware or its form combined, for example, those skilled in the art Member can implement these as the form of software code, can be can be realized the computer of the corresponding logic function of the step can It executes instruction.When it is realized in the form of software, the executable instruction be can store in memory, and by equipment Processor execute.
The device or module that above-described embodiment illustrates can specifically realize by computer chip or entity, or by having The product of certain function is realized.A kind of typically to realize that equipment is computer, the concrete form of computer can be personal meter Calculation machine, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation are set It is any several in standby, E-mail receiver/send equipment, game console, tablet computer, wearable device or these equipment The combination of equipment.
For convenience of description, it is divided into various modules when description apparatus above with function to describe respectively.Certainly, implementing this The function of each module can be realized in the same or multiple software and or hardware when specification one or more embodiment.
It should be understood by those skilled in the art that, this specification one or more embodiment can provide for method, system or Computer program product.Therefore, complete hardware embodiment can be used in this specification one or more embodiment, complete software is implemented The form of example or embodiment combining software and hardware aspects.Moreover, this specification one or more embodiment can be used one It is a or it is multiple wherein include computer usable program code computer-usable storage medium (including but not limited to disk storage Device, CD-ROM, optical memory etc.) on the form of computer program product implemented.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want There is also other identical elements in the process, method of element, commodity or equipment.
This specification one or more embodiment can computer executable instructions it is general on It hereinafter describes, such as program module.Generally, program module includes executing particular task or realization particular abstract data type Routine, programs, objects, component, data structure etc..Can also practice in a distributed computing environment this specification one or Multiple embodiments, in these distributed computing environments, by being executed by the connected remote processing devices of communication network Task.In a distributed computing environment, the local and remote computer that program module can be located at including storage equipment is deposited In storage media.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.It is adopted especially for data For collecting equipment or data processing equipment embodiment, since it is substantially similar to the method embodiment, so the comparison of description is simple Single, the relevent part can refer to the partial explaination of embodiments of method.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.
The foregoing is merely the preferred embodiments of this specification one or more embodiment, not to limit this public affairs It opens, all within the spirit and principle of the disclosure, any modification, equivalent substitution, improvement and etc. done should be included in the disclosure Within the scope of protection.

Claims (10)

1. a kind of data selecting method, the method is applied to by selected section at least two second data sides of offer data Second data side;The method is executed by the first data side, and the first data that the first data side possesses include: machine learning The training set and test set of model;The training set includes multiple training samples, and the test set includes multiple test samples;
The described method includes:
Enter moding amount and label, the training machine learning model according in training sample;The training sample further includes not Participate in machine learning model training does not enter moding amount;
It moding amount will be entered described in the test sample inputs the machine learning model to obtain predicted value;The test sample It further include label, the expection predicted value for entering moding amount input machine learning model of the tag representation test sample;
According to the label of the test sample and the predicted value, the corresponding residual error of the test sample is obtained;
The residual error is respectively sent at least two second data side, so that each second data side is respectively using gathering around The second data regression having is fitted the residual error, and obtains returning evaluation index;
The recurrence evaluation index that at least two second data side returns respectively is received, by comparing described at least two the The second data side of recurrence evaluation index selected section of two data sides.
2. according to the method described in claim 1, the method also includes:
The sample identification of the training sample and test sample is sent to the second data side, so that the second data root is according to institute It states sample identification progress sample matches and obtains second data.
3. according to the method described in claim 1, the method also includes:
Receive the sample matches rate of the second data side return;
It is higher than in the second data of preset threshold by sample matches rate, according to recurrence evaluation index by least two second data sides The second data that middle the second data side of selected section possesses.
4. a kind of data selecting method, the method is executed by the second data side, comprising:
The residual error of the first data side transmission is received, the residual error is that the first data root is inputted according to the moding amount that enters in test sample The label of predicted value and test sample that machine learning model obtains obtains;The data that the first data side possesses include: Training set and test set, the training set include multiple training samples, and the test set includes multiple test samples;The machine Learning model be according in training sample enter moding amount and label training obtains;It further include not entering moding in the training sample Amount;
The sample identification of the first data side transmission is received, and sample matches are carried out according to the sample identification and are obtained for participating in back Return the second data of fitting;
It is fitted the residual error based on second data regression, obtains returning evaluation index;
The recurrence evaluation index is returned into the first data side, so that the first data side is by comparing to be selected at least two The second data side of recurrence evaluation index selected section of a second data side.
5. according to the method described in claim 4, the method also includes:
Obtain at least one following parameter of second data: sample matches rate and variable miss rate;
The parameter is returned into the first data side, so that the first data side is in conjunction with the parameter and returns evaluation index selection The second data side of part.
6. a kind of data selection means, described device is applied to by selected section at least two second data sides of offer data Second data side;Described device is applied to the first data side, and the first data that the first data side possesses include: machine learning The training set and test set of model;The training set includes multiple training samples, and the test set includes multiple test samples;Institute Stating device includes:
Model training module, for entering moding amount and label, the training machine learning model according in the training sample; The training sample further include have neither part nor lot in machine learning model training do not enter moding amount;
Model prediction module, for will enter described in the test sample moding amount input the machine learning model obtain it is pre- Measured value;The test sample further includes label, and the moding amount that enters of the tag representation test sample inputs machine learning model It is expected that predicted value;
Residual computations module, for according to the test sample label and the predicted value, it is corresponding residual to obtain test sample Difference;
Data transmission blocks, for the residual error to be respectively sent at least two second data side, so that each Two data sides are fitted the residual error using the second data regression possessed respectively, and obtain returning evaluation index;
Verification processing module, the recurrence evaluation index returned respectively for receiving at least two second data side, to pass through Compare the second data side of recurrence evaluation index selected section of at least two second data side.
7. device according to claim 6,
The verification processing module is also used to receive the sample matches rate of the second data side return;It is higher than by sample matches rate pre- If in the second data of threshold value, being gathered around according to evaluation index is returned by the second data side of selected section at least two second data sides The second data having.
8. a kind of data selection means, described device is applied to the second data side, which includes:
Residual error receiving module, for receiving the residual error of the first data side transmission, the residual error is the first data root according to test specimens The predicted value that moding amount input machine learning model obtains of entering in this and label obtain;The data that the first data side possesses It include: training set and test set, the training set includes multiple training samples, and the test set includes multiple test samples;Institute State machine learning model be according in training sample enter moding amount and label training obtains;It further include not in the training sample Enter moding amount;
Data match module, the sample identification sent for receiving the first data side, and carried out according to the sample identification Sample matches obtain the second data for participating in regression fit;
Processing module is returned, for being fitted the residual error based on second data regression, obtains returning evaluation index;
Index feedback module, for the recurrence evaluation index to be returned to the first data side, so that the first data side passes through Compare the second data side of recurrence evaluation index selected section of at least two second data sides to be selected.
9. a kind of data selection equipment, the equipment include memory, processor and storage on a memory and can be in processor The computer program of upper operation, the processor perform the steps of when executing described program
Enter moding amount and label, training machine learning model according in training sample;The training sample further includes having neither part nor lot in Machine learning model training does not enter moding amount;
It moding amount will be entered described in the test sample inputs the machine learning model to obtain predicted value;The test sample It further include label, the expection predicted value for entering moding amount input machine learning model of the tag representation test sample;
According to the label of the test sample and the predicted value, the corresponding residual error of the test sample is obtained;
The residual error is respectively sent at least two second data side, so that each second data side is respectively using gathering around The second data regression having is fitted the residual error, and obtains returning evaluation index;
The recurrence evaluation index that at least two second data side returns respectively is received, by comparing described at least two the The second data side of recurrence evaluation index selected section of two data sides.
10. a kind of data selection equipment, the equipment include memory, processor and storage on a memory and can be in processor The computer program of upper operation, the processor perform the steps of when executing described program
The residual error of the first data side transmission is received, the residual error is that the first data root is inputted according to the moding amount that enters in test sample The label of predicted value and test sample that machine learning model obtains obtains;The data that the first data side possesses include: Training set and test set, the training set include multiple training samples, and the test set includes multiple test samples;The machine Learning model be according in training sample enter moding amount and label training obtains;It further include not entering moding in the training sample Amount;
The sample identification that the first data side is sent is received, and sample matches are carried out according to the sample identification and are obtained for joining With the second data of regression fit;
It is fitted the residual error based on second data regression, obtains returning evaluation index;
The recurrence evaluation index is returned into the first data side, so that the first data side is by comparing to be selected at least two The second data side of recurrence evaluation index selected section of a second data side.
CN201811286327.4A 2018-10-31 2018-10-31 Data selection method and device Active CN109299161B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811286327.4A CN109299161B (en) 2018-10-31 2018-10-31 Data selection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811286327.4A CN109299161B (en) 2018-10-31 2018-10-31 Data selection method and device

Publications (2)

Publication Number Publication Date
CN109299161A true CN109299161A (en) 2019-02-01
CN109299161B CN109299161B (en) 2022-01-28

Family

ID=65145327

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811286327.4A Active CN109299161B (en) 2018-10-31 2018-10-31 Data selection method and device

Country Status (1)

Country Link
CN (1) CN109299161B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961098A (en) * 2019-03-22 2019-07-02 中国科学技术大学 A kind of training data selection method of machine learning
CN110968886A (en) * 2019-12-20 2020-04-07 支付宝(杭州)信息技术有限公司 Method and system for screening training samples of machine learning model
CN111401483A (en) * 2020-05-15 2020-07-10 支付宝(杭州)信息技术有限公司 Sample data processing method and device and multi-party model training system
CN111612167A (en) * 2019-02-26 2020-09-01 京东数字科技控股有限公司 Joint training method, device, equipment and storage medium of machine learning model
CN112149834A (en) * 2019-06-28 2020-12-29 北京百度网讯科技有限公司 Model training method, device, equipment and medium
CN112183757A (en) * 2019-07-04 2021-01-05 创新先进技术有限公司 Model training method, device and system
US11295242B2 (en) 2019-11-13 2022-04-05 International Business Machines Corporation Automated data and label creation for supervised machine learning regression testing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719147A (en) * 2009-11-23 2010-06-02 合肥兆尹信息科技有限责任公司 Rochester model-naive Bayesian model-based data classification system
US9501749B1 (en) * 2012-03-14 2016-11-22 The Mathworks, Inc. Classification and non-parametric regression framework with reduction of trained models
CN108280462A (en) * 2017-12-11 2018-07-13 北京三快在线科技有限公司 A kind of model training method and device, electronic equipment
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning
CN108596757A (en) * 2018-04-23 2018-09-28 大连火眼征信管理有限公司 A kind of personal credit file method and system of intelligences combination

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719147A (en) * 2009-11-23 2010-06-02 合肥兆尹信息科技有限责任公司 Rochester model-naive Bayesian model-based data classification system
US9501749B1 (en) * 2012-03-14 2016-11-22 The Mathworks, Inc. Classification and non-parametric regression framework with reduction of trained models
CN108280462A (en) * 2017-12-11 2018-07-13 北京三快在线科技有限公司 A kind of model training method and device, electronic equipment
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning
CN108596757A (en) * 2018-04-23 2018-09-28 大连火眼征信管理有限公司 A kind of personal credit file method and system of intelligences combination

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程求江 等: ""基于用户行为模型的客流量分析与预测"", 《计算机系统应用》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111612167A (en) * 2019-02-26 2020-09-01 京东数字科技控股有限公司 Joint training method, device, equipment and storage medium of machine learning model
CN111612167B (en) * 2019-02-26 2024-04-16 京东科技控股股份有限公司 Combined training method, device, equipment and storage medium of machine learning model
CN109961098A (en) * 2019-03-22 2019-07-02 中国科学技术大学 A kind of training data selection method of machine learning
CN109961098B (en) * 2019-03-22 2022-03-01 中国科学技术大学 Training data selection method for machine learning
CN112149834A (en) * 2019-06-28 2020-12-29 北京百度网讯科技有限公司 Model training method, device, equipment and medium
CN112149834B (en) * 2019-06-28 2023-11-07 北京百度网讯科技有限公司 Model training method, device, equipment and medium
CN112183757A (en) * 2019-07-04 2021-01-05 创新先进技术有限公司 Model training method, device and system
CN112183757B (en) * 2019-07-04 2023-10-27 创新先进技术有限公司 Model training method, device and system
US11295242B2 (en) 2019-11-13 2022-04-05 International Business Machines Corporation Automated data and label creation for supervised machine learning regression testing
CN110968886A (en) * 2019-12-20 2020-04-07 支付宝(杭州)信息技术有限公司 Method and system for screening training samples of machine learning model
CN111401483A (en) * 2020-05-15 2020-07-10 支付宝(杭州)信息技术有限公司 Sample data processing method and device and multi-party model training system

Also Published As

Publication number Publication date
CN109299161B (en) 2022-01-28

Similar Documents

Publication Publication Date Title
CN109299161A (en) A kind of data selecting method and device
Jabbari et al. What is DevOps? A systematic mapping study on definitions and practices
Luna et al. Quantitative reasoning about cloud security using service level agreements
Babaioff et al. Combinatorial agency
Ling et al. Selection of model in developing information security criteria on smart grid security system
EP2201491A1 (en) Apparatus for reconfiguration of a technical system based on security analysis and a corresponding technical decision support system and computer program product
CN109508558A (en) A kind of verification method and device of data validity
US20220058266A1 (en) Methods and systems of a cybersecurity scoring model
Wei et al. Two plane camera calibration: a unified model
Maghrabi et al. Improved software vulnerability patching techniques using CVSS and game theory
Molka et al. Conformance checking for BPMN-based process models
Lin et al. Project reliability interval for a stochastic project network subject to time and budget constraints
Zalazar et al. Analyzing requirements engineering for cloud computing
Alturkistani et al. A review of security risk assessment methods in cloud computing
Jansen Research directions in security metrics
Ramireddy et al. Privacy and Security Practices in the Arena of Cloud Computing-A Research in Progress
Lauster et al. Literature review linking blockchain and business process management
Khrisna Risk management framework with COBIT 5 and risk management framework for cloud computing integration
Younis et al. Towards the Impact of Security Vunnerabilities in Software Design: A Complex Network-Based Approach
Baars et al. Analysing the Security Risks of Cloud Adoption Using the SeCA Model: A Case Study.
Kang et al. Process Mining-based Understanding and Analysis of Volvo IT's Incident and Problem Management Processes.
Hidayat et al. Process model extension using heuristics miner:(Case study: Incident management of Volvo IT Belgium)
Eftekhar et al. Towards the development of a widely accepted cloud trust model
Chorppath et al. Risk management for it security: When theory meets practice
CN109657482B (en) Data validity verification method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40003742

Country of ref document: HK

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant