CN110533489A - Sample acquiring method and device, equipment, storage medium applied to model training - Google Patents

Sample acquiring method and device, equipment, storage medium applied to model training Download PDF

Info

Publication number
CN110533489A
CN110533489A CN201910851779.0A CN201910851779A CN110533489A CN 110533489 A CN110533489 A CN 110533489A CN 201910851779 A CN201910851779 A CN 201910851779A CN 110533489 A CN110533489 A CN 110533489A
Authority
CN
China
Prior art keywords
sample
training
rate
sample set
data source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910851779.0A
Other languages
Chinese (zh)
Other versions
CN110533489B (en
Inventor
王星雅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910851779.0A priority Critical patent/CN110533489B/en
Publication of CN110533489A publication Critical patent/CN110533489A/en
Application granted granted Critical
Publication of CN110533489B publication Critical patent/CN110533489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/70Game security or game management aspects
    • A63F13/79Game security or game management aspects involving player-related data, e.g. identities, accounts, preferences or play histories
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0277Online advertisement

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Finance (AREA)
  • General Business, Economics & Management (AREA)
  • Multimedia (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Embodiments herein discloses a kind of sample acquiring method and device applied to model training.This method comprises: carrying out stochastical sampling to data source, training sample set is obtained;Model is trained according to training sample set, obtains the prediction error rate of training sample set;The sample rate of data source is redefined according to prediction error rate, and data source is sampled according to the sample rate redefined, obtains the training sample set of update;Iteration executes the training sample set training pattern according to update, and the sample rate of data source is redefined according to the prediction error rate of acquisition, and the step of obtaining the training sample set updated according to the sample rate redefined;It obtains the training sample set updated and is combined into target sample set.The technical solution of the embodiment of the present application can choose the target sample set being close with the feature distribution of forecast sample set from data source, and the target sample set is for being trained model.

Description

Sample acquiring method and device, equipment, storage medium applied to model training
Technical field
This application involves technical field of data processing, in particular to a kind of sample acquisition applied to model training Method and device, equipment, computer readable storage medium.
Background technique
With the rapid development of Internet information technique, Internet application is also more and more, in order to widen Internet application Audient face, enable Internet application be enterprise bring higher income, need to promote Internet application to more User.
At the popularization initial stage of Internet application, such as game application, in the case where lacking initial player, need to pass through trip The feedback data that the first batch of player obtained is promoted in play carries out popularization modeling as sample, or with the similar game application of type Player's data carry out popularization modeling as sample.Since the popularization modeling of game application generally requires a large amount of player's data, The effect for promote modeling by former mode is preferable, but the sample acquisition higher cost for promoting modeling.Although Promote modeling by latter approach and can easily obtain a large amount of sample from the similar game application of type, but by Sample in these game applications is not fully identical as the feature distribution of the sample in going game application, and Rate Based On The Extended Creep Model exists It is easy to produce deviation when actual prediction, causes prediction effect bad.
Therefore, at the popularization initial stage of Internet application, how to be protected while reducing the sample acquisition cost for promoting modeling The prediction effect for demonstrate,proving Rate Based On The Extended Creep Model is urgent problem to be solved in the prior art.
Summary of the invention
In order to solve the above-mentioned technical problem, embodiments herein provides a kind of sample acquisition applied to model training Method, apparatus, equipment and computer readable storage medium, embodiments herein obtains sample, and cost is relatively low, And promote the model for modeling and obtaining with preferable prediction effect using the sample that embodiments herein obtains.
Wherein, technical solution used by the application are as follows:
A kind of sample acquiring method applied to model training, comprising: stochastical sampling is carried out to data source, obtains training sample This set;Model is trained according to the training sample set, obtains the prediction error rate of the training sample set;Root The sample rate of the data source is redefined according to the prediction error rate, and according to the sample rate redefined to the data source It is sampled, obtains the training sample set of update;Iteration is executed according to the training sample set of the update training model, and The sample rate of the data source is redefined according to the prediction error rate of acquisition, and is obtained more according to the sample rate redefined The step of new training sample set;The training sample set for obtaining the update is combined into target sample set, the target sample Set is for carrying out subsequent training to the model.
A kind of sample acquiring device applied to model training, comprising: data source sampling module, for being carried out to data source Stochastical sampling obtains training sample set;Model training module, for being instructed according to the training sample set to model Practice, obtains the prediction error rate of the training sample set;Training sample update module, for according to the prediction error rate weight It newly determines the sample rate of the data source, and the data source is sampled according to the sample rate redefined, updated Training sample set;Iteration execution module is executed for iteration according to the training sample set of the update training model, and The sample rate of the data source is redefined according to the prediction error rate of acquisition, and is obtained more according to the sample rate redefined The step of new training sample set;Target sample obtains module, and the training sample set for obtaining the update is combined into target Sample set, the target sample set are used to carry out subsequent training to the model.
A kind of sample acquiring device applied to model training, including processor and memory store on the memory There is computer-readable instruction, is realized when the computer-readable instruction is executed by the processor and be applied to model as described above Trained sample acquiring method.
A kind of computer readable storage medium, is stored thereon with computer-readable instruction, when the computer-readable instruction When being executed by the processor of computer, computer is made to execute the sample acquiring method for being applied to model training as described above.
Technical solution used by the application at least has the advantages that
It, still can be by type even if a small amount of sample of only current internet application at the popularization initial stage of Internet application A small amount of sample of great amount of samples and current internet application in similar Internet application is as data source, then by above-mentioned Technical solution obtains target sample set from data source, will constantly root due in the acquisition process of target sample set It is predicted that error rate is updated training sample set, this renewal process can be chosen from data source and answer with current internet The sample characteristics used are distributed the biggish sample of similarity as target sample.Rate Based On The Extended Creep Model is being carried out using target sample set In subsequent training, since the feature distribution of target sample set is consistent with current internet application, according to target sample The prediction deviation for gathering the Rate Based On The Extended Creep Model being trained is smaller, has preferable prediction effect.
In addition, in the sample acquisition carried out according to the above technical scheme, without being obtained in being applied from current internet A large amount of sample carries out promoting the procurement cost for modeling required sample for Internet application to reduce.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The application can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the application Example, and together with specification it is used to explain the principle of the application.It should be evident that the accompanying drawings in the following description is only the application Some embodiments without creative efforts, can also be according to this for ordinary skill person A little attached drawings obtain other attached drawings.In the accompanying drawings:
Fig. 1 is a kind of process of sample acquiring method applied to model training shown according to an exemplary embodiment Figure;
Fig. 2 be in embodiment illustrated in fig. 1 step 130 in the flow chart of one embodiment;
Fig. 3 be in embodiment illustrated in fig. 1 step 130 in the flow chart of another embodiment;
Fig. 4 is the update schematic diagram of training sample set shown according to an exemplary embodiment;
Fig. 5 be in embodiment illustrated in fig. 1 step 150 in the flow chart of another embodiment;
Fig. 6 is a kind of block diagram of sample acquiring device applied to model training shown according to an exemplary embodiment;
Fig. 7 is a kind of hardware knot of sample acquiring device applied to model training shown according to an exemplary embodiment Structure schematic diagram.
Specific embodiment
Here will the description is performed on the exemplary embodiment in detail, the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the application.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the application.
With the development of Internet information technique, the popularization of Internet application is also more and more intelligent, such as can pass through Machine learning model treats popularization user and carries out popularization coefficient prediction, if the popularization coefficient that prediction obtains is higher, accordingly User to be promoted it is also higher for the acceptance level of the Internet application, by by Internet application promote to promote coefficient compared with High user to be promoted can not only promote the popularization efficiency of Internet application, can also obtain preferable promotion effect.
Machine learning model usually requires a large amount of user data and is trained as sample, but in Internet application At popularization initial stage, due to lacking the user data of current internet application, only leading to too small amount of sample can not training machine study mould Type, therefore a large amount of user data can be generally extracted from other Internet applications similar with current internet application type Machine learning model is trained, but due to the feature distribution and current internet of the user data in similar Internet application The feature distribution of user data is not fully identical in, and the machine learning model that training obtains is easy to produce in actual prediction Raw deviation, causes prediction effect bad.
And if extracting a large amount of sample from current internet application, it needs first to carry out current internet application big Dynamics is promoted, and to accumulate user data, is then extracted from a large number of users data of accumulation again and is used for training machine learning model Sample, cause the procurement cost of sample high, and then aggravated the burden of enterprise.
In order to solve the above-mentioned technical problem, the one side of the application provides a kind of sample acquisition applied to model training On the other hand method additionally provides a kind of sample acquiring device applied to model training.
Referring to Fig. 1, Fig. 1 is a kind of sample acquisition side applied to model training shown according to an exemplary embodiment The flow chart of method.As shown in Figure 1, in one exemplary embodiment, should can wrap applied to the sample acquiring method of model training Include following steps:
Step 110, stochastical sampling is carried out to data source, obtains training sample set.
Wherein, data source refers to the candidate samples set being provided previously, and is mixed in candidate samples set and forecast sample The identical first kind sample of feature distribution, and the second class sample similar with the feature distribution of forecast sample.
Contain a large amount of candidate samples in candidate samples set, candidate samples can be used for carrying out model training.Forecast sample Refer to the sample that the model that training obtains is predicted in actual prediction.
In order to make model that there is preferable prediction effect for forecast sample in actual prediction, should select and pre- test sample The biggish training sample of feature distribution similarity that this feature distribution is identical or with forecast sample is trained model, because This, needs to choose satisfactory candidate samples from data source as target sample and carries out model training.
And in order to obtain satisfactory target sample, it needs first to carry out stochastical sampling to data source, obtains training sample Set.Since training sample set is the trained sample of each of training sample set by obtained through stochastical sampling to data source This sample rate should be it is identical, in other words, the weight of each training sample is identical.
In one embodiment, data source is stored in the node of block chain network, and data source is from block chain network It acquires.
Wherein, block chain is the computer technologies such as Distributed Storage, point-to-point transmission, common recognition mechanism, Encryption Algorithm New application mode.Block chain (Blockchain) is substantially the database of a decentralization, is a string using password Method is associated the data block generated, the information of a batch network trading is contained in each data block, for verifying it The validity (anti-fake) and the next block of generation of information.Block chain may include block chain underlying platform, platform product service Layer and application service layer.
Block chain underlying platform may include user management, infrastructure service, the intelligent processing such as contract and monitoring operation mould Block.Wherein, user management module is responsible for the identity information management of all block chain participants, including the public and private key of maintenance generates (account Family management), key management and user real identification and block chain address corresponding relationship maintenance (rights management) etc., and awarding In the case where power, the trading situation for certain true identities of supervising and audit provides the rule configuration (air control audit) of risk control; Infrastructure service module is deployed on all block chain node devices, for verifying the validity of service request, and to effective request It is recorded in storage after completing common recognition, the service request new for one, infrastructure service is first adapted at parsing and authentication interface It manages (interface adaptation), business information is then encrypted by (consensus management) by common recognition algorithm, after encryption complete consistent biography It transports on shared account book (network communication), and carries out record storage;Intelligent contract module is responsible for the registration distribution of contract and is closed About triggering and contract executes, and developer can define contract logic by certain programming language, be published to (contract on block chain Registration), according to the logic of agreement terms, calls key or other event triggerings to execute, complete contract logic, while also mentioning For the function of nullifying contract upgrading;The modification of deployment, configuration that monitoring operation module is mainly responsible in product issuing process is closed The about visualization output of setting, cloud adaptation and the running real-time status of product, such as: alarm, monitoring network condition, monitoring Node device health status etc..
Step 120, model is trained according to training sample set, obtains the prediction error rate of training sample set.
Firstly the need of explanation, model described in this embodiment refers to machine learning model, for example, logistic regression The one of which of model, decision-tree model, supporting vector machine model, neural network model, this place is without limiting.
The process being trained using training sample set to model, substantially model are for every in training sample set A training sample carries out the process of Tag Estimation.It is counted by the prediction result to model for each training sample, i.e., It can get the prediction error rate of training sample set.
In one embodiment, it by the way that each training sample in set will be trained to be input to model, obtains model and is directed to The prediction label that each training prediction obtains, the true tag of prediction label and training sample is compared, mould can be obtained Whether type is correct for the prediction result of each training sample, and calculates the training samples number and training sample set of prediction error The ratio of total sample number in conjunction, to obtain the prediction error rate of training sample set.
In another embodiment, also according to the prediction error of the weight calculation training sample set of each training sample Rate.The formula for calculating prediction error rate is as follows:
Wherein, t indicates the quantity of training sample in training sample set, and i indicates that training sample, n indicate training sample set The quantity of first kind sample in conjunction, m indicate the quantity of the second class sample in training sample set, ht(xi) indicate each trained sample This prediction label, c (xi) indicate the true tag of each training sample,Indicate the weight of each training sample.
As previously mentioned, the weight of each training sample corresponds to the sample rate of data source, therefore in this step, Mei Gexun It is identical for practicing the weight of sample, such as is all 1.
Step 130, the sample rate of data source is redefined according to prediction error rate, and according to the sample rate pair redefined Data source is sampled, and the training sample set of update is obtained.
It is readily comprehensible, during model training, the feature distribution of training sample itself will affect model itself for The study of Prediction Parameters, and then model is influenced for the prediction result of training sample, therefore, the prediction error of training sample set Rate embodies the otherness of feature distribution between each training sample in training sample set.
In other words, in training sample set, there are prediction error rates corresponding to prediction error rate or training sample to be greater than In the case where threshold value, exist between the feature distribution of the second class sample and the feature distribution of first kind sample in training sample set Different differs greatly.In other words, the feature distribution and prediction of the second class sample contained in training sample set There is some difference or differs greatly for the feature distribution of sample.
Therefore, to obtain the preferable model of forecast result, it is necessary to choose further feature distribution from data source It is similar to the training sample of forecast sample, that is, is trained the update of sample set, and according to the training sample set of update Training pattern.
In the present embodiment, the sample rate of data source is redefined according to prediction error rate, and is adopted according to what is redefined Sample rate samples data source, in the training sample set retrieved, will divide in data source with the feature of forecast sample Cloth differs biggish candidate samples and filters out, so that update in obtained training sample set, the feature distribution of training sample and pre- The feature distribution of test sample sheet is more approximate.
And the training sample set by updating is trained model, that is, may make that model is targetedly pre- to itself It surveys parameter to be learnt and updated, to guarantee prediction effect of the model in actual prediction.
Step 140, iteration executes the training sample set training pattern according to update, and according to the prediction error rate of acquisition Redefine the sample rate of data source, and the step of obtaining the training sample set updated according to the sample rate redefined.
Wherein, this step is the iteration implementation procedure to step 120 and step 130, according to the training sample set of update Training pattern, the corresponding model that obtains is for the prediction error rate of the training sample set updated, then according to prediction error rate weight The new sample rate for determining data source, and obtained in the training sample set updated according to the sample rate redefined, further It filters out in data source and differs biggish candidate samples with the feature distribution of forecast sample.
Step 150, the training sample set for obtaining update is combined into target sample set.
As previously mentioned, the process that iteration executes in step 140 is the process constantly updated to training sample set, Update the similarity degree between resulting training sample set feature distribution on the whole and the feature distribution of forecast sample It will constantly be promoted, therefore after iteration executes, and can will update obtained training sample set and be retrieved as target sample Set.
It is subsequent model to be trained according to target sample set, until model stability.And due to target sample collection Feature distribution similarity between conjunction and forecast sample is higher, therefore trained model has preferably in advance forecast sample Survey effect.
Also, even if only containing a small amount of first kind sample identical with the feature distribution of forecast sample in data source, and It, also can be from number using the method for the present embodiment and containing largely the second class sample similar with the feature distribution of forecast sample The subsequent training of model is carried out closer to the target sample set of forecast sample according to selected characteristic distribution in source, and then guarantees mould The forecast result of type.
In actual application scenarios, such as the popularization initial stage of Internet application, due to lacking Internet application to be promoted In user data, a small amount of user data can only be extracted in Internet application for model training from wait promote, it is extracted A small amount of user data is first kind sample.Due to model training it is generally necessary to a large amount of training sample, also need with to It promotes a large number of users data extracted in the similar Internet application of Internet application type and is used for model training, it is extracted big Measuring user data is the second class sample.By mixing first kind sample and the second class sample, to obtain data source.Prediction Sample is then the user data wait promote the user to be promoted in Internet application.
Using Internet application to be promoted as game application for example, the similar game application of type can be in game product It is similar to game application to be promoted in the conditions such as class, playing method, publisher.
In this application scenarios, model can be the machine learning model for predicting the popularization coefficient of user to be promoted, It is higher wait push away to choose popularization coefficient for popularization coefficient by model prediction user to be promoted relative to Internet application to be promoted Wide user is the popularization that target user carries out Internet application, to reach preferable promotion effect.
In data source, the feature distribution of the second class sample and the feature distribution of forecast sample differ greatly, such as wait push away The age of player is generally less than the age of player in the similar game application of type in wide game application, therefore second in data source Class sample is in the feature distribution in age characteristics and there are deviations between forecast sample.In order to guarantee model for forecast sample Forecast result needs the target sample that selected characteristic distribution is similar to forecast sample from data source to carry out model instruction Practice.
Training sample set is obtained by carrying out stochastical sampling to data source based on method provided in this embodiment;Then Model is trained according to training sample set, obtains the prediction error rate of training sample set;Further according to prediction error rate The sample rate for redefining data source samples data source according to the sample rate redefined, obtains the training sample of update This set has filtered out and the biggish training sample of user characteristics differing distribution to be promoted in the training sample set of the update;So Iteration executes the training sample set training pattern according to update afterwards, redefines data source according to the prediction error rate of acquisition Sample rate, and the step of obtaining the training sample set updated according to the sample rate redefined, obtain the training sample of update This collection is combined into target sample set.
Iteration execute training sample set update in, the feature distribution of training sample set on the whole with wait promote The feature distribution similarity of user will be promoted constantly, the feature distribution of target sample set obtained and internet to be promoted Using being consistent.
Therefore, subsequent training is carried out to model according to target sample set, obtained model is directed to user to be promoted Popularization coefficient prediction deviation it is smaller, there is preferable prediction effect, and without obtaining from wait promote in Internet application A large amount of user data, for module training, models required sample to reduce and promote for Internet application as sample This procurement cost.
Referring to Fig. 2, Fig. 2 be in embodiment illustrated in fig. 1 step 130 in the flow chart of one embodiment.As shown in Fig. 2, In one exemplary embodiment, data source includes candidate samples set, redefines data according to prediction error rate in step 130 The process of the sample rate in source may comprise steps of:
Step 131, the sampling sampled to each candidate samples in candidate samples set is calculated according to prediction error rate The factor;
Step 132, increase the sample rate of the first kind sample of prediction error according to decimation factor, and reduce prediction error The second class sample sample rate.
Decimation factor is used to adjust the sample rate of each candidate samples in candidate samples set, the calculation formula of decimation factor It is as follows:
β=∈t/(1-∈t)b
Wherein, β indicates decimation factor, ∈tIndicate the prediction error rate being calculated in step 120, b refers to adjustment sampling The super ginseng of factor size.It should be noted that the value of β is greater than 1.
Due to first kind sample and forecast sample feature distribution having the same, model is pre- for the label of first kind sample Survey should be correctly, but in the training of model, however it remains the first kind sample of prediction error, then illustrate model for The prediction of first kind sample needs that more consideration is given to need to increase power of the first kind sample of prediction error in model training Weight.That is, needing to increase in candidate samples set for the sample rate of the first kind sample of prediction error.
In one embodiment, the step of increasing the sample rate of first kind sample of prediction error according to decimation factor are as follows: For the first kind sample of prediction error, using decimation factor as the truth of a matter, by the true tag and prediction label of first kind sample Absolute value of the difference as index carry out power operation, obtain first kind sample sample rate.
The sample rate calculation formula of first kind sample are as follows: Pa|y-ypred|, wherein PaIndicate the sample rate of first kind sample, Y indicates that the true tag of first kind sample, ypred indicate model for the prediction label of first kind sample.
Illustratively, for the first kind sample of prediction error, | y-ypred | value be equal to 1, since the value of β is greater than 1, PaValue also greater than 1, thereby increase the sample rate of the first kind sample of prediction error.For predicting correct first kind sample This, | y-ypred | value be 0, therefore keep predicting that the sample rate of correct first kind sample is constant.
Since the second class sample and forecast sample have similar feature distribution, and the feature of each second class sample point Cloth and the similarity of forecast sample feature distribution are different, for the second class sample of model prediction mistake, illustrate its with it is pre- The feature distribution of test sample sheet differs greatly, and needs to reduce the sampling in candidate samples set for the second class sample of prediction error Rate.
In one embodiment, the step of sample rate of the second class sample of prediction error is reduced according to decimation factor are as follows: For the second class sample of prediction error, using decimation factor as the truth of a matter, by the true tag and prediction label of the second class sample Absolute value of the difference opposite number as index carry out power operation, obtain
The sample rate of second class sample.
The sample rate calculation formula of second class sample are as follows: Pb-|y-ypred|, wherein PbIndicate the sampling of the second class sample Rate, y indicate that the true tag of the second class sample, ypred indicate model for the prediction label of the second class sample.
Similarly, for the second class sample of prediction error, | y-ypred | value be equal to 1, PbValue less than 1, thus reduce The sample rate of second class sample of prediction error.For predicting correct second class sample, | y-ypred | value be 0, therefore It keeps predicting that the sample rate of correct second class sample is constant.
Therefore, by the above process, first kind sample and the second class sample in candidate samples set can be redefined Sample rate.
Fig. 3 be in embodiment illustrated in fig. 1 step 130 in the flow chart of another embodiment.As shown in figure 3, in an example Property embodiment in, in step 130 according to the sample rate redefined to data source carry out sampling obtain update training sample set The process of conjunction may comprise steps of:
Step 135, according to the sample rate redefined, the candidate of default sample rate threshold value is lower than to sample rate in data source Sample is filtered;
Step 136, stochastical sampling is carried out to the candidate samples set being obtained by filtration, obtains the training sample set of update.
It is greater than 1 as previously mentioned, for the sample rate that the first kind sample of model prediction mistake is redefined, and model is pre- The sample rate for surveying correct first kind sample remains 1, therefore the sample rate of first kind sample is all larger than 1 in data source.
And the sample rate redefined for the second class sample of model prediction mistake is less than 1, these the second class samples Feature distribution and the feature distribution of forecast sample differ greatly, the sample rate of the correct second class sample of model prediction remains 1。
As a result, in order to obtain the target sample set that feature distribution is closer to forecast sample, by filtering out data source In can be realized with the second class sample that the feature distribution of forecast sample differs greatly.
Therefore, by the way that default sample rate threshold value is arranged, which corresponds to candidate samples and forecast sample Between feature distribution diversity factor, default sample rate threshold value is lower than to sample rate in data source according to the sample rate redefined Candidate samples are filtered, and carry out stochastical sampling, training sample set obtained to the candidate samples set being obtained by filtration In filtered the second class sample that the feature distribution between forecast sample differs greatly.Wherein, it is small to preset sample rate threshold value In 1.
In a further embodiment, directly data source can also be sampled according to the sample rate redefined, for Sample rate is lower than the second class sample of default sample rate threshold value in data source, directly filters in sampling process.
The method for leading to the present embodiment as a result, can be updated the training sample set for carrying out model training.In During being updated to the iteration that the training sample set of update executes next round, due to the first kind sample of last round of prediction error This weight increases, and influence of these first kind samples in epicycle training for model prediction is relatively low;And due to upper one The weight for taking turns the second class sample of prediction error reduces, influence of these the second class samples in epicycle training for model prediction It is relatively high.The sample rate of candidate samples in data source is redefined according to the prediction error rate of acquisition later, and according to again Determining sample rate determines that the training sample set updated will further filter out in the training sample set redefined The the second class sample to differ greatly with forecast sample feature distribution.
After the training sample set that iteration executes several times, or phase identical as forecast sample feature distribution in data source There is higher sample rate (namely weight in model training) like higher candidate samples are spent, and the spy of those and forecast sample The sample rate of the sign biggish candidate samples of distributional difference can reduce.
As shown in figure 4, if being indicated to use Tb to the first kind sample in training sample set obtained by data source sampling with Ta Indicate that sampling to the second class sample in training sample obtained by data source sampling, indicates forecast sample set with S, in training sample In the iteration renewal process of set, since the sample rate of the first kind sample of model prediction mistake increases accordingly, iteration each time The quantity of the first kind sample in training sample set updated also increases.And due to the second class sample of model prediction mistake Sample rate accordingly reduce, the quantity of the second class sample can accordingly reduce in the training sample set that iteration is updated each time. In training sample set each time updates, the feature distribution of training sample set on the whole is also more close to prediction Sample set.
In a further exemplary embodiment, the training sample set for obtaining update, which is combined into the process of target sample set, can wrap Include following steps:
It executes in iteration according to the training sample set of the update training model, and according to the prediction error rate weight of acquisition The new sample rate for determining data source, and the step of obtaining the training sample set updated according to the sample rate redefined reach After preset times, the training sample set for obtaining recent renewal is combined into target sample set.
In other words, the present embodiment is provided with preset the number of iterations for the iteration renewal process of training sample set, After iterating to preset times, the training sample set and the feature distribution of forecast sample set on the whole of recent renewal are indicated The degree of approximation reaches more excellent, so that the training sample set for obtaining recent renewal is combined into target sample set.Utilize target sample set Subsequent training is carried out to model, until model stability, that is, aloow model to reach in the actual prediction to forecast sample Preferable prediction effect.
In a further exemplary embodiment, as shown in figure 5, obtaining the training sample set updated is combined into target sample set Process may comprise steps of:
Step 151, it executes in every iteration once according to the training sample set of the update training model, and according to acquisition Prediction error rate redefine the sample rate of data source, and the training sample updated is obtained according to the sample rate redefined After the step of set, the feature distribution similarity between the training sample set of update and forecast sample is obtained.
Step 152, when feature distribution similarity reaches preset similarity threshold, the training sample of recent renewal is obtained Collection is combined into target sample set.
Wherein, to be more nearly the feature distribution of the target sample set obtained with forecast sample set, per more After new round training sample set, the feature distribution that can be calculated between the sample set of update and forecast sample set is similar Degree.
If this feature, which is distributed similarity, is less than preset similarity threshold, then it represents that the training sample set currently updated It is also undesirable, need to continue the update of next round, until update training sample set and forecast sample set it Between feature distribution similarity reach similarity threshold.
Compared to the embodiment that setting the number of iterations obtains target sample set, the present embodiment will be similar by feature distribution The degree of closeness between precise quantification training sample set and forecast sample is spent, so that acquired target sample set is more Add accurate.
Fig. 6 is a kind of block diagram of sample acquiring device applied to model training shown according to an exemplary embodiment. As shown in fig. 6, should include data source sampling module applied to the sample acquiring device of model training in one exemplary embodiment 210, model training module 220, training sample update module 230, iteration execution module 240 and target sample obtain module 250.
Data source sampling module 210 is used to carry out stochastical sampling to data source, obtains training sample set.
Model training module 220 obtains training sample set for being trained according to training sample set to model Prediction error rate.
Training sample update module 230 is used to redefine the sample rate of data source according to prediction error rate, and according to weight Newly determining sample rate samples data source, obtains the training sample set of update.
Iteration execution module 240 executes the training sample set training pattern according to update for iteration, and according to acquisition Prediction error rate redefine the sample rate of data source, and the training sample updated is obtained according to the sample rate redefined The step of set.
Target sample obtains module 250 and is combined into target sample set for obtaining the training sample set updated.
In a further exemplary embodiment, model training module 220 includes that prediction label acquiring unit and prediction rate calculate Unit.
Prediction label acquiring unit is used to obtain mould by the way that each training sample in training sample set is input to model Type is directed to the prediction label that each training sample is predicted.
Prediction rate computing unit is used to calculate instruction according to the true tag of each training sample, weight and prediction label Practice the prediction error rate of sample set, which corresponds to the sample rate of data source.
In a further exemplary embodiment, data source includes candidate samples set, and candidate samples set includes and pre- test sample Eigen is distributed identical first kind sample, and the second class sample similar with forecast sample feature distribution;Training sample updates Module 230 includes that decimation factor computing unit and sample rate increase and decrease unit.
Decimation factor computing unit be used for according to prediction error rate calculate to each candidate samples in candidate samples set into The decimation factor of row sampling.
Sample rate increases and decreases the sample rate that unit is used to increase according to decimation factor the first kind sample of prediction error, and drop The sample rate of second class sample of low prediction error.
In a further exemplary embodiment, sample rate increase and decrease unit includes that sample rate increases subelement, and sample rate increases son Unit is used for the first kind sample for prediction error, regard decimation factor as the truth of a matter, by the true tag of first kind sample and The absolute value of the difference of prediction label carries out power operation as index, obtains the sample rate of first kind sample
In a further exemplary embodiment, sample rate increase and decrease unit further includes that sample rate reduces subelement, and sample rate reduces Subelement is used for the second class sample for prediction error, using decimation factor as the truth of a matter, by the true tag of the second class sample Power operation is carried out as index with the opposite number of the absolute value of the difference of prediction label, obtains the sample rate of the second class sample.
In a further exemplary embodiment, quantity of the quantity of first kind sample less than the second class sample.
In a further exemplary embodiment, training sample update module 230 further includes sample filter element and sampling by filtration Unit.
Sample filter element is used for according to the sample rate redefined, is lower than default sample rate threshold to sample rate in data source The candidate samples of value are filtered.
Sampling by filtration unit is used to carry out stochastical sampling to the candidate samples set being obtained by filtration, and obtains the training sample of update This set.
In a further exemplary embodiment, target sample obtains module 250 and is used to execute the training according to update in iteration Sample set training pattern, and redefine according to the prediction error rate of acquisition the sample rate of data source, and according to again really After the step of training sample set that fixed sample rate acquisition updates reaches preset times, the training sample set of recent renewal is obtained It is combined into target sample set.
In a further exemplary embodiment, it includes similarity calculated and threshold value comparison that target sample, which obtains module 250, Unit.
Similarity calculated is used to execute once in every iteration according to the training sample set training pattern of update, and root The sample rate of data source is redefined according to the prediction error rate of acquisition, and the instruction updated is obtained according to the sample rate redefined After the step of practicing sample set, the feature distribution similarity between the training sample set of update and forecast sample set is obtained.
Threshold value comparison unit is used to obtain recent renewal when this feature distribution similarity reaches preset similarity threshold Training sample set be combined into target sample set.
It in a further exemplary embodiment, further include that data source obtains mould applied to the sample acquiring device of model training Block, the data source obtain module for obtaining data source from block chain network, and wherein data source is stored in block chain network In node.
It should be noted that method provided by device provided by above-described embodiment and above-described embodiment belongs to same structure Think, the concrete mode that wherein modules and unit execute operation is described in detail in embodiment of the method, herein It repeats no more.
Another aspect based on the application additionally provides a kind of sample acquiring device applied to model training, including place Manage device and memory, wherein computer-readable instruction is stored on memory, when which is executed by processor Realize the sample acquiring method for being applied to model training as previously described.
Referring to Fig. 7, Fig. 7 is that a kind of sample acquisition applied to model training shown according to an exemplary embodiment is set Standby hardware structural diagram.
It should be noted that the equipment is the example for adapting to the application, it must not believe that there is provided to this Shen Any restrictions of use scope please.The equipment can not be construed to need to rely on or must have shown in Fig. 7 to show One or more component in the sample acquiring device of example property.
The hardware configuration of the equipment can generate biggish difference due to the difference of configuration or performance, as shown in fig. 7, this sets Standby includes: power supply 610, interface 630, at least a memory 650 and at least central processing unit (CPU, a Central Processing Units)670。
Wherein, power supply 610 is used to provide operating voltage for each hardware device in the equipment.
Interface 630 includes an at least wired or wireless network interface 631, at least a string and translation interface 633, at least one defeated Enter output interface 635 and at least USB interface 637 etc., is used for and external device communication.
The carrier that memory 650 is stored as resource, can be read-only memory, random access memory, disk or CD Deng the resource stored thereon includes operating system 651, application program 653 or data 655 etc., and storage mode can be short Temporary storage permanently stores.Wherein, operating system 651 is used to managing and controlling each hardware device and the application in the equipment Program 653 can be Windows to realize calculating and processing of the central processing unit 670 to mass data 655 ServerTM, Mac OS XTM, UnixTM, LinuxTM etc..Application program 653 is to be based on completing at least on operating system 651 The computer program of one particular job, may include an at least module, and each module can have been separately included and set to this Standby series of computation machine readable instruction.
Central processing unit 670 may include the processor of one or more or more, and be set as through bus and memory 650 communications, for the mass data 655 in operation and processing memory 650.
As described in detail above, the sample acquiring device for being applicable in the application will be read by central processing unit 670 is deposited The form of the series of computation machine readable instruction stored in reservoir 650 come complete as previously described be applied to model training sample Acquisition methods.
In addition, also can equally realize the application by hardware circuit or hardware circuit combination software instruction, therefore, realize The application is not limited to the combination of any specific hardware circuit, software and the two.
Another aspect based on the application additionally provides a kind of computer readable storage medium, is stored thereon with computer Program realizes the sample acquiring method for being applied to model training as previously described when the computer program is executed by processor.It should Computer readable storage medium can be to be wrapped in the sample acquiring device for being applied to model training described in above-described embodiment Contain, is also possible to individualism, and without in the supplying sample acquiring device.
Above content, only the preferable examples embodiment of the application, the embodiment for being not intended to limit the application, this Field those of ordinary skill can very easily carry out corresponding flexible or repair according to the central scope and spirit of the application Change, therefore the protection scope of the application should be subject to protection scope required by claims.

Claims (11)

1. a kind of sample acquiring method applied to model training characterized by comprising
Stochastical sampling is carried out to data source, obtains training sample set;
Model is trained according to the training sample set, obtains the prediction error rate of the training sample set;
The sample rate of the data source is redefined according to the prediction error rate, and according to the sample rate redefined to described Data source is sampled, and the training sample set of update is obtained;
Iteration is executed according to the training sample set of the update training model, and is redefined according to the prediction error rate of acquisition The sample rate of the data source, and the step of obtaining the training sample set updated according to the sample rate redefined;
The training sample set for obtaining the update is combined into target sample set, the target sample set be used for the model into The subsequent training of row.
2. the method according to claim 1, wherein described instruct model according to the training sample set Practice, obtain the prediction error rate of the training sample set, comprising:
By the way that each training sample in the training sample set is input to the model, the model is obtained for each instruction Practice the prediction label that sample predictions obtain;
According to the true tag of each training sample, weight and the prediction label, the pre- of the training sample set is calculated Error rate is surveyed, the weight corresponds to the sample rate of the data source.
3. the method according to claim 1, wherein the data source includes candidate samples set, the candidate Sample set includes first kind sample identical with forecast sample feature distribution, and similar to the forecast sample feature distribution The second class sample;The sample rate that the data source is redefined according to the prediction error rate, comprising:
The decimation factor sampled to each candidate samples in the candidate samples set is calculated according to the prediction error rate;
Increase the sample rate of the first kind sample of prediction error according to the decimation factor, and reduces the institute of prediction error State the sample rate of the second class sample.
4. according to the method described in claim 3, increasing prediction error it is characterized in that, described according to the decimation factor The sample rate of the first kind sample, comprising:
For the first kind sample of prediction error, using the decimation factor as the truth of a matter, by the true of the first kind sample The absolute value of the difference of real label and prediction label carries out power operation as index, obtains the sample rate of the first kind sample.
5. according to the method described in claim 3, reducing prediction error it is characterized in that, described according to the decimation factor The sample rate of the second class sample, comprising:
For the second class sample of prediction error, using the decimation factor as the truth of a matter, by the true of the second class sample The opposite number of the absolute value of the difference of real label and prediction label carries out power operation as index, obtains the second class sample Sample rate.
6. according to the described in any item methods of claim 3 to 5, which is characterized in that the quantity of the first kind sample is less than institute State the quantity of the second class sample.
7. according to the method described in claim 3, it is characterized in that, it is described according to the sample rate redefined to the data source It is sampled, obtains the training sample set of update, comprising:
According to the sample rate redefined, the candidate samples for being lower than default sample rate threshold value to sample rate in the data source are carried out Filtering;
Stochastical sampling is carried out to the candidate samples set being obtained by filtration, obtains the training sample set of update.
8. the method according to claim 1, wherein the training sample set for obtaining the update is combined into target Sample set, comprising:
It is executed in iteration and the model is trained according to the training sample set of update, and is again true according to the prediction error rate of acquisition The sample rate of the fixed data source, and the step of obtaining the training sample set updated according to the sample rate redefined reach After preset times, the training sample set for obtaining recent renewal is combined into target sample set.
9. the method according to claim 1, wherein the training sample set for obtaining the update is combined into target Sample set, comprising:
It executes in every iteration once according to the training sample set of the update training model, and according to the prediction error rate of acquisition The sample rate of the data source is redefined, and obtains the step of the training sample set updated according to the sample rate redefined After rapid, the feature distribution similarity between the training sample set of the update and forecast sample set is obtained;
When the feature distribution similarity reaches preset similarity threshold, the training sample set for obtaining recent renewal is combined into mesh Mark sample set.
10. the method according to claim 1, wherein the method also includes:
The data source is obtained from block chain network, the data source is stored in the node of the block chain network.
11. a kind of sample acquiring device applied to model training characterized by comprising
Data source sampling module obtains training sample set for carrying out stochastical sampling to data source;
Model training module obtains the training sample set for being trained according to the training sample set to model Prediction error rate;
Training sample update module, for redefining the sample rate of the data source according to the prediction error rate, and according to The sample rate redefined samples the data source, obtains the training sample set of update;
Iteration execution module is executed for iteration according to the training sample set of the update training model, and according to acquisition Prediction error rate redefines the sample rate of the data source, and the training sample updated is obtained according to the sample rate redefined The step of this set;
Target sample obtains module, and the training sample set for obtaining the update is combined into target sample set, the target sample This set is for carrying out subsequent training to the model.
CN201910851779.0A 2019-09-05 2019-09-05 Sample obtaining method and device applied to model training, equipment and storage medium Active CN110533489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910851779.0A CN110533489B (en) 2019-09-05 2019-09-05 Sample obtaining method and device applied to model training, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910851779.0A CN110533489B (en) 2019-09-05 2019-09-05 Sample obtaining method and device applied to model training, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110533489A true CN110533489A (en) 2019-12-03
CN110533489B CN110533489B (en) 2021-11-05

Family

ID=68668011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910851779.0A Active CN110533489B (en) 2019-09-05 2019-09-05 Sample obtaining method and device applied to model training, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110533489B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177238A (en) * 2019-12-13 2020-05-19 北京航天云路有限公司 Data set generation method based on user definition
CN111291416A (en) * 2020-05-09 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for preprocessing data of business model based on privacy protection
CN111797942A (en) * 2020-07-23 2020-10-20 深圳壹账通智能科技有限公司 User information classification method and device, computer equipment and storage medium
CN112395401A (en) * 2020-11-17 2021-02-23 中国平安人寿保险股份有限公司 Adaptive negative sample pair sampling method and device, electronic equipment and storage medium
CN112734086A (en) * 2020-12-24 2021-04-30 贝壳技术有限公司 Method and device for updating neural network prediction model
CN113191824A (en) * 2021-05-24 2021-07-30 北京大米科技有限公司 Data processing method and device, electronic equipment and readable storage medium
CN115938353A (en) * 2022-11-24 2023-04-07 北京数美时代科技有限公司 Voice sample distributed sampling method, system, storage medium and electronic equipment
CN112395401B (en) * 2020-11-17 2024-06-04 中国平安人寿保险股份有限公司 Self-adaptive negative sample pair sampling method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779094A (en) * 2017-01-13 2017-05-31 湖南文理学院 A kind of limitation Boltzmann machine learning method and device based on random feedback
CN108091397A (en) * 2018-01-24 2018-05-29 浙江大学 A kind of bleeding episode Forecasting Methodology for the Ischemic Heart Disease analyzed based on promotion-resampling and feature association
CN108875776A (en) * 2018-05-02 2018-11-23 北京三快在线科技有限公司 Model training method and device, business recommended method and apparatus, electronic equipment
CN109598281A (en) * 2018-10-11 2019-04-09 阿里巴巴集团控股有限公司 A kind of business risk preventing control method, device and equipment
US20190213447A1 (en) * 2017-02-08 2019-07-11 Nanjing University Of Aeronautics And Astronautics Sample selection method and apparatus and server

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779094A (en) * 2017-01-13 2017-05-31 湖南文理学院 A kind of limitation Boltzmann machine learning method and device based on random feedback
US20190213447A1 (en) * 2017-02-08 2019-07-11 Nanjing University Of Aeronautics And Astronautics Sample selection method and apparatus and server
CN108091397A (en) * 2018-01-24 2018-05-29 浙江大学 A kind of bleeding episode Forecasting Methodology for the Ischemic Heart Disease analyzed based on promotion-resampling and feature association
CN108875776A (en) * 2018-05-02 2018-11-23 北京三快在线科技有限公司 Model training method and device, business recommended method and apparatus, electronic equipment
CN109598281A (en) * 2018-10-11 2019-04-09 阿里巴巴集团控股有限公司 A kind of business risk preventing control method, device and equipment

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177238A (en) * 2019-12-13 2020-05-19 北京航天云路有限公司 Data set generation method based on user definition
CN111177238B (en) * 2019-12-13 2023-12-08 北京航天云路有限公司 User-defined data set generation method
CN111291416A (en) * 2020-05-09 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for preprocessing data of business model based on privacy protection
CN111291416B (en) * 2020-05-09 2020-07-31 支付宝(杭州)信息技术有限公司 Method and device for preprocessing data of business model based on privacy protection
CN111797942A (en) * 2020-07-23 2020-10-20 深圳壹账通智能科技有限公司 User information classification method and device, computer equipment and storage medium
CN112395401A (en) * 2020-11-17 2021-02-23 中国平安人寿保险股份有限公司 Adaptive negative sample pair sampling method and device, electronic equipment and storage medium
CN112395401B (en) * 2020-11-17 2024-06-04 中国平安人寿保险股份有限公司 Self-adaptive negative sample pair sampling method and device, electronic equipment and storage medium
CN112734086A (en) * 2020-12-24 2021-04-30 贝壳技术有限公司 Method and device for updating neural network prediction model
CN113191824A (en) * 2021-05-24 2021-07-30 北京大米科技有限公司 Data processing method and device, electronic equipment and readable storage medium
CN115938353A (en) * 2022-11-24 2023-04-07 北京数美时代科技有限公司 Voice sample distributed sampling method, system, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN110533489B (en) 2021-11-05

Similar Documents

Publication Publication Date Title
CN110533489A (en) Sample acquiring method and device, equipment, storage medium applied to model training
Abdelmoniem et al. Refl: Resource-efficient federated learning
CN101859254B (en) System and method for automatically recommending upgrade approach
US10409699B1 (en) Live data center test framework
US9444717B1 (en) Test generation service
CN109086139B (en) Dynamic fragmentation method, device and computer storage medium
CN110798467B (en) Target object identification method and device, computer equipment and storage medium
US20080244690A1 (en) Deriving remediations from security compliance rules
US9396160B1 (en) Automated test generation service
CN110287111A (en) A kind of method for generating test case and device of user interface
CN113486584B (en) Method and device for predicting equipment failure, computer equipment and computer readable storage medium
CN110188910A (en) The method and system of on-line prediction service are provided using machine learning model
CN110719320B (en) Method and equipment for generating public cloud configuration adjustment information
CN114430826A (en) Time series analysis for predicting computational workload
CN110991789B (en) Method and device for determining confidence interval, storage medium and electronic device
US11677770B2 (en) Data retrieval for anomaly detection
US20220335297A1 (en) Anticipatory Learning Method and System Oriented Towards Short-Term Time Series Prediction
CN110855648A (en) Early warning control method and device for network attack
CN110427371A (en) Server FRU field management method, device, equipment and readable storage medium storing program for executing
CN109345373A (en) Check and write off method for prewarning risk, device, electronic equipment and computer-readable medium
CN109588054A (en) Accurate and detailed modeling using distributed simulation engine to the system with large complicated data set
CN107844859B (en) Large medical equipment energy consumption prediction method based on artificial intelligence and terminal equipment
US20190227530A1 (en) Managing activities on industrial products according to compliance with reference policies
CN115660073B (en) Intrusion detection method and system based on harmony whale optimization algorithm
CN107645388A (en) A kind of method, client and server for realizing telecommunication apparatus networking control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant