CN115689130A

CN115689130A - Reward strategy configuration method, device, electronic equipment, storage medium and product

Info

Publication number: CN115689130A
Application number: CN202110785290.5A
Authority: CN
Inventors: 南征; 谷浩然; 张斌
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2023-02-03

Abstract

The invention provides a reward strategy configuration method, a reward strategy configuration device, electronic equipment, a medium and a product, wherein the method comprises the following steps: predicting an order-averaging strategy obtained by each user and an order quantity under the order-averaging strategy; fitting the list-average strategy obtained by each user and the estimated list amount under the list-average strategy to obtain the estimated list amount of each user under the high list-average strategy and the total strategy of each user during the estimated list amount; under the budget limit, determining a target list average strategy of each user and a making list amount under the target list average strategy; selecting a full order reward strategy for each user from a preset candidate reward strategy set according to a total strategy of each user when the user pre-estimates the amount of the reward, a target order average strategy and the amount of the reward under the target order average strategy; and taking the full single reward strategy selected for each user as a reward strategy configured for the corresponding user. The invention distributes individualized reward strategies for drivers with different order receiving capacities, and improves the order receiving efficiency of the drivers and the platform volume of the platform.

Description

Reward strategy configuration method, device, electronic equipment, storage medium and product

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a reward policy configuration method and apparatus, an electronic device, a storage medium, and a product.

Background

With the rapid development of the network car booking industry, the network car booking can bring great convenience to people.

In the related art, in order to encourage drivers to receive more orders, the operation platform sets a uniform reward strategy, such as [0, 3,0,7,0, 10], which indicates that: corresponding rewards are arranged at the 3 rd order, the 5 th order and the 8 th order which are received by the driver, namely the driver can additionally obtain the corresponding rewards after finishing the 3 rd order, can additionally obtain the corresponding rewards after finishing the 5 th order, can additionally obtain the corresponding rewards after finishing the 8 th order and the like, the rewards corresponding to each order can be different, and the ordinary situation is that the amount of the rewards is gradually increased along with the making of the orders and the like. As another example, the reward policy may also be set to [3,4, 0], or [0, 7], etc.

However, in the related art, the reward policies set for drivers are all static, that is, a uniform reward policy distribution mode is set for all drivers, and actually, because the order receiving capacity of each driver is different, the uniform reward policy mode cannot encourage some drivers to receive more orders. Therefore, in the related art, the configuration mode of the reward strategy set for the driver is single, the reward strategy configuration is unbalanced, and certain drivers cannot be stimulated to receive more orders, so that the platform traffic is reduced.

Therefore, how to configure personalized reward strategy ways for different drivers to improve the order taking efficiency of the drivers and the platform volume of the platform is a technical problem to be solved at present.

Disclosure of Invention

The invention provides a reward strategy configuration method, a reward strategy configuration device, electronic equipment, a storage medium and a product, which are used for at least solving the technical problems that in the related technology, because a personalized reward strategy configuration mode cannot be configured for a driver user, the reward strategy configuration mode is single and unbalanced, the order receiving efficiency is low and the platform traffic volume is reduced. The technical scheme of the invention is as follows

According to a first aspect of the embodiments of the present invention, there is provided a reward policy configuration method, including:

predicting an order-averaging strategy obtained by each user and an order quantity under the order-averaging strategy;

fitting the list-average strategy obtained by each user and the estimated making single quantity of each user under the list-average strategy to obtain the estimated making single quantity of each user under the high list-average strategy and the total strategy of each user during the estimated making single quantity;

under the budget limit, determining a target list average strategy of each user and a list making quantity under the target list average strategy;

selecting a full order reward strategy for each user from a preset candidate reward strategy set according to the total strategy of each user when the estimated making order quantity is used, a target order average strategy and the making order quantity under the target order average strategy;

and distributing the selected full single reward strategy to corresponding users.

Optionally, the predicting the order-averaging policy obtained by each user and the order quantity under the order-averaging policy includes:

under the condition of full order reward, obtaining an order average strategy, a history order making strategy and history data in a preset time period, which are obtained by each user;

sampling the acquired single-average strategy of each user to obtain single-average sampling data;

and inputting the list average sampling data, the historical list making strategy and the historical data in a preset time period into a trained prediction model for prediction processing to obtain the list average strategy obtained by each user and the list making quantity under the list average strategy.

Optionally, the method further includes:

acquiring variable data of each user;

and carrying out prediction training on the prediction model based on the variable data to obtain a trained prediction model.

Optionally, the variable data includes: history list making and strategy, history list making and strategy of the user group and list average strategy actually obtained in the same day by the user group;

the training the prediction model based on the variable data to obtain the trained prediction model comprises:

inputting the historical ordering and strategy, the historical ordering and strategy of the user group and the order average strategy actually obtained in the same day of the user group into a first-stage XGB model for prediction training processing to obtain the order average strategy obtained in the same day of each user;

and inputting the list-average strategy, the history list-making strategy and the history data in a preset time period obtained by each user into a second-level XGB model for prediction training processing to obtain the list-making quantity under the list-average strategy, wherein the second-level XGB model is used as a trained prediction model.

Optionally, the fitting process is performed on the predicted list average strategy obtained by each user and the predicted list making quantity under the list average strategy to obtain the predicted list making quantity of each user under the high list average strategy, and the total strategy of each user when the predicted list making quantity is obtained includes:

performing curve fitting on the list-average strategy obtained by each user and the making single quantity under the list-average strategy to obtain a fitting curve of the list-average strategy and the corresponding maximum making single quantity of each user under the high list-average strategy;

according to the fitting curve, pre-estimated unit making quantity of each user in the high-order-average strategy is pre-estimated;

and calculating the product of the high-order-average strategy and the estimated ordering amount of each user under the high-order-average strategy to obtain the total strategy of each user when the ordering amount is estimated.

Optionally, the estimated unit amount of each user in the high-order-average strategy of each user is estimated through the following fitting curve formula, where the fitting curve formula is:

wherein r (k) is estimated basis weight under high basis weight; k is the list-average strategy of each user; a, b and c are constants respectively, and are calculated through the list average strategy actually obtained by different users and the list forming amount of the list average strategy.

Optionally, the determining, under the budget limit, the optimal target order-averaging strategy obtained by each user and the order making amount under the target order-averaging strategy includes:

under the budget limit, determining an optimal target list average strategy obtained by each user and a list making quantity under the target list average strategy according to a list average optimal allocation algorithm, wherein the list average optimal allocation algorithm is realized by the following formula:

where max represents the total singleton that maximizes the ensemble of individual users; xi, j, which is a decision variable and indicates whether a jth target list average strategy is distributed to a user i; ri, j represents the order making quantity of the user i under the jth target order average strategy, and the order making quantity is found out through a fitting curve; kj represents the single mean value of the jth single mean strategy; k0, s.t. represents the total-to-average policy constraint for a user population, including: each user can only be assigned one single-average strategy, and the predicted total single of all users is equal to the single-average budget.

Optionally, the selecting a full order reward policy for each user from preset candidate reward policies according to the total policy of each user when the predicted making order amount is estimated, a target order average policy and the making order amount under the target order average policy includes:

calculating the product of the target list average strategy obtained by each user and the list making quantity under the target list average strategy to obtain a corresponding product result;

comparing the total strategy of each user when the estimated singles are made with the corresponding product result to obtain a corresponding difference value;

and selecting the candidate reward strategy with the minimum difference value from a preset candidate reward strategy set as a corresponding user full order reward strategy.

According to a second aspect of the embodiments of the present invention, there is provided a reward policy configuration apparatus including:

the prediction module is used for predicting a target order average strategy obtained by each user and the order making quantity under the target order average strategy;

the fitting processing module is used for fitting the list-average strategy obtained by each user and the estimated making single quantity under the list-average strategy to obtain the estimated making single quantity of each user under the high list-average strategy and the total strategy of each user when the estimated making single quantity is obtained;

the determining module is used for determining target list average strategies of all users and list making quantities under the target list average strategies under the budget limit;

the selection module is used for selecting a full order reward strategy for each user from preset candidate reward strategies according to the total strategy of each user when the estimated order making quantity is used, a target order average strategy and the order making quantity under the target order average strategy;

and the distribution module is used for distributing the full single prize reward strategy selected by the selection module to corresponding users.

Optionally, the prediction module includes:

the first acquisition module is used for acquiring the order-sharing strategy, the historical order-making strategy and the historical data in the preset time period, which are acquired by each user, under the condition of full order reward;

the sampling module is used for sampling the single-average strategy of each user acquired by the first acquisition module to obtain single-average sampling data;

and the prediction processing module is used for inputting the list-average sampling data sampled by the sampling module, the historical list making strategy and the historical data in the preset time period acquired by the first acquisition module into a trained prediction model for prediction processing to obtain the list-average strategy acquired by each user and the list making quantity under the list-average strategy.

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring variable data of each user;

and the prediction training module is used for performing prediction training on the prediction model based on the variable data to obtain a trained prediction model.

Optionally, the variable data acquired by the second acquiring module includes: history list making and strategy, history list making and strategy of the user group and list average strategy actually obtained in the same day by the user group;

the predictive training module includes:

the first prediction training module is used for inputting the historical ordering and strategy, the historical ordering and strategy of the user group and the list average strategy actually obtained in the same day by the user group into the first-stage XGB model for prediction training processing to obtain the list average strategy obtained in the same day by each user;

and the second prediction training module is used for inputting the order-averaging strategy, the history ordering strategy and the history data in the preset time period obtained by each user into a second-stage XGB model for prediction training processing to obtain the ordering amount under the order-averaging strategy, wherein the second-stage XGB model is used as a trained prediction model.

Optionally, the fitting processing module includes:

the curve fitting module is used for performing curve fitting on the list-average strategy obtained by each user and the making single quantity under the list-average strategy to obtain a fitting curve of the list-average strategy and the corresponding maximum making single quantity of each user under the high list-average strategy;

the order-making estimation module is used for estimating the estimated order-making amount of each user in the high order-average strategy according to the fitting curve;

and the first calculation module is used for calculating the product of the high-order-average strategy and the estimated ordering amount of each user under the high-order-average strategy to obtain the total strategy of each user when the ordering amount is estimated.

Optionally, the order estimation module estimates estimated order amounts of each user in the high order average strategy by using the following fitting curve formula, where the fitting curve formula is:

wherein r (k) is estimated basis weight under high basis weight; k is the list-average strategy of each user; a, b and c are constants respectively, and are calculated through the list average strategy actually obtained by different users and the list amount of the list average strategy.

Optionally, the determining module is specifically configured to determine, according to a list-average optimal allocation algorithm, an optimal target list-average policy obtained by each user and a list making amount under the target list-average policy under the budget limit, where the list-average optimal allocation algorithm is implemented by using the following formula:

where max represents the total singleton that maximizes the ensemble of individual users; xi, j, which is a decision variable and indicates whether a jth target list average strategy is allocated to the user i; ri, j represents the order making quantity of the user i under the jth target order average strategy, and is found out through a fitting curve; kj represents the single mean value of the jth single mean strategy; k0, s.t. represents the total-to-average policy constraint for a user population, including: each user can only be assigned one single-average policy, and the predicted total single of all users is equal to the single-average budget.

Optionally, the selecting module includes:

the second calculation module is used for calculating the product of the target list average strategy obtained by each user and the list making quantity under the target list average strategy to obtain a corresponding product result;

the comparison module is used for comparing the target total strategy of each user when the estimated ordering quantity is obtained with the corresponding product result to obtain a corresponding difference value;

and the strategy selection module is used for selecting the candidate reward strategy with the minimum difference value from the preset candidate reward strategies as the corresponding user full order reward strategy.

According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to perform any of the reward policy configuration methods described above.

According to a fourth aspect of embodiments of the present invention, there is provided a computer-readable storage medium, wherein instructions of the computer-readable storage medium, when executed by a processor of an electronic device, cause the electronic device to perform any one of the above-mentioned reward policy configuration methods.

According to a fifth aspect of embodiments of the present invention, there is provided a computer program product, wherein instructions of the computer program product, when executed by a processor of an electronic device, cause the electronic device to perform any one of the above-mentioned reward policy configuration methods.

The technical scheme provided by the embodiment of the invention at least has the following beneficial effects:

in the embodiment of the invention, the order-averaging strategy obtained by each user and the order quantity under the order-averaging strategy are predicted; fitting the list-average strategy obtained by each user and the estimated list making quantity under the list-average strategy to obtain the estimated list making quantity of each user under the high list-average strategy and the total strategy of each user when the estimated list making quantity is obtained; under the budget limit, determining a target list average strategy of each user and a list making quantity under the target list average strategy; selecting a full order reward strategy for each user from a preset candidate reward strategy set according to the total strategy of each user when the estimated making order quantity is used, a target order average strategy and the making order quantity under the target order average strategy; and distributing the selected full single reward strategy to corresponding users. That is to say, in the embodiment of the present invention, the dynamic allocation of the reward policy may be performed on any form of activity, without depending on the history data of a specific activity form. The method can be particularly applied to daily distribution and cold start of various reward activities based on prospective hand-order-sharing strategies and order making quantities under the hand-order-sharing strategies and combined reward strategies under the full-order reward condition. Particularly, for drivers with different order receiving capabilities, personalized reward strategies can be distributed to the drivers, so that the order receiving efficiency of the drivers is improved, and the traffic of the platform is also improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention and are not intended to limit the invention.

Fig. 1 is a flowchart of a reward policy configuration method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a training process of a prediction model according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of prediction of a prediction model provided by an embodiment of the invention.

FIG. 4 is a schematic diagram of a single-volume curve for a single-averaging strategy according to an embodiment of the present invention.

Fig. 5 is a block diagram of a reward policy configuration apparatus according to an embodiment of the present invention.

Fig. 6 is a block diagram of a prediction module according to an embodiment of the present invention.

Fig. 7 is another block diagram of a reward policy configuration apparatus according to an embodiment of the present invention.

Fig. 8 is a block diagram of an electronic device according to an embodiment of the present invention.

FIG. 9 is a block diagram illustrating an apparatus having a reward policy configuration, according to an example embodiment.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a reward policy configuration distribution method according to an exemplary embodiment, where, as shown in fig. 1, the reward policy configuration method is used in a server or a terminal, and includes the following steps:

in step 101, predicting a list-average strategy obtained by each user and a list making amount under the list-average strategy;

in step 102, fitting the list-average strategy obtained by each user and the estimated making single quantity under the list-average strategy to obtain the estimated making single quantity of each user under the high list-average strategy and the total strategy of each user when the estimated making single quantity is obtained;

in step 103, under the budget limit, determining a target list average strategy of each user and a list making amount under the target list average strategy;

in step 104, selecting a full order reward strategy for each user from a preset candidate reward strategy set according to the total strategy of each user when the estimated order making quantity is used, a target order average strategy and the order making quantity under the target order average strategy;

in step 105, the selected full single reward strategy is distributed to corresponding users.

The reward strategy configuration method can be applied to terminals, servers and the like without limitation, and terminal implementation equipment can be electronic equipment such as a smart phone, a notebook computer, a tablet computer and the like without limitation.

The following describes, with reference to fig. 1, specific implementation steps of a reward policy configuration method provided in an embodiment of the present invention in detail.

Firstly, step 101 is executed to predict the order-based average strategy obtained by each user and the order making quantity under the order-based average strategy.

Each user in this embodiment may be a driver or another user, and this embodiment is not limited. Specifically, under the condition of full order reward, obtaining an order average strategy, a history order making strategy and history data in a preset time period, which are obtained by each user; sampling the acquired single-average strategy of each user to obtain single-average sampling data; and inputting the list-average sampling data, the historical list making strategy and the historical data in a preset time period into a trained prediction model for prediction processing to obtain the list-average strategy obtained by each user and the list making quantity under the list-average strategy.

In this embodiment, under the full order reward, the order-averaging strategy obtained by each corresponding user on the same day and the making amount under the order-averaging strategy can be obtained through a prediction model, wherein the prediction model can also be called an "anticipating order-making amount" model, the prediction model needs to be trained first, and then the trained prediction model is used for predicting the making amount of the driver in the order-averaging strategy. The training and prediction process of the prediction model specifically comprises the following steps:

1) The training process of the prediction model comprises the following steps:

in this embodiment, first, variable data of each user is obtained; wherein the variable data may include: the user history ordering and policy, the history ordering and policy of the group to which the user belongs, the actual order ordering and policy of the group to which the user belongs in the same day, and the like, but other data may be included according to needs in the actual application, and the embodiment is not limited.

And then, carrying out prediction training on the prediction model based on the variable data to obtain a trained prediction model.

The prediction model is subjected to prediction training based on the variable data to obtain a trained prediction model, and the method specifically comprises the following steps: inputting the historical ordering and strategy, the historical ordering and strategy of the user group and the order average strategy actually obtained by the user group on the same day into a first-stage XGB model for prediction training processing to obtain the order average strategy actually obtained by each user on the same day; and then, inputting the list-average strategy, the user history list making strategy and the historical data in a preset time period of each user into a second-level XGB model for prediction training processing to obtain the list making quantity of each user under the list-average strategy, wherein the second-level XGB model is used as a trained prediction model. Specifically, as shown in fig. 2, fig. 2 is a schematic diagram of a training process of a prediction model according to an embodiment of the present invention.

As shown in fig. 2, in the form of a full lottery, the actual hand-order-average policy of the user (such as the driver) is related to the order-making quantity, if the order-average policy is a2= [0, 7] for example, under a2, if the driver makes an order quantity of 5, it obtains the corresponding order-average policy of 0 yuan, if the driver makes an order quantity of 7, it obtains an order-average policy of 7 yuan, so that the argument (the order-average policy) and the argument (the order quantity) are causal to each other, and interfere with the learning of the relationship therebetween.

To solve this problem, the present embodiment introduces a tool variable, which may also be referred to as variable data, wherein the tool variable may include: the user (such as a driver) history list making and strategy, the user group actual hand list average strategy and the like in the same day, the user expected hand list average strategy and the like in the same day can be predicted by using tool variables, and the actual hand list average strategy of the user (such as the driver) is replaced. And calculating the order making quantity under the order averaging strategy through the order averaging strategy, namely solving the relation between the bill strategy and the order making quantity. The process of making singletons through the prediction of the tool variables is as follows: and each sample is formed by X = { driver history makes order and strategy, driver group actual order average strategy in the same day }, Y = { driver actual order average strategy in the same day }, a first-stage XGB model is trained, tool variables are input into the first-stage XGB model for training according to X prediction Y, and the first-stage XGB model outputs the user expected order average strategy in the same day.

After the current-day expected order-averaging strategy of the user is obtained, the current-day expected order-averaging strategy of the user, historical ordering and strategy in tool variables and historical data in a preset time period are input into a second-stage XGB model for predictive training processing, and the relation between the current-day expected order-averaging strategy of the driver and the ordering quantity is fitted. Each sample consists of X = { driver history characteristics, large-disc history characteristics, driver prediction averaging strategy }, and Y = { single quantity during activity }. And predicting Y according to X by using a second-level XGB model to obtain the order quantity of the user under the order-averaging strategy. Wherein the second level XGB model is used as a trained predictive model.

2) The prediction process of the prediction model, specifically as shown in fig. 3, is a schematic diagram of prediction of the prediction model provided in the embodiment of the present invention:

after the prediction model (i.e., the second-stage XGB model) is trained, a trained prediction model is obtained, as shown in fig. 3. Under the condition of full order reward, acquiring order-averaging strategies, historical ordering and strategies and historical data in a preset time period of each user; then, sampling the acquired single-average strategy of each user to obtain single-average sampling data; the order of the sampling list average strategy may be from low to high, or from high to low, and the like, and this embodiment is not limited. And finally, inputting the list-average sampling data, the historical list making strategy and the historical data in a preset time period into a trained prediction model (a second-stage XGB model) for prediction processing to obtain the list making quantity of each user under the list-average strategy on the same day.

Secondly, executing step 102, fitting the list-average strategy obtained by each user and the estimated list making quantity under the list-average strategy to obtain the estimated list making quantity of each user under the high list-average strategy and the total strategy of each user when the estimated list making quantity is obtained.

In this step, curve fitting may be performed on the order-average strategy and the order quantity under the order-average strategy of each user to obtain a fitted curve of the order-average strategy and the corresponding maximum order quantity of each user under the high order-average strategy; and estimating the total strategy of each user when the user makes the maximum single-making quantity under the single-making strategy according to the fitting curve of the single-making strategy of each user under the high single-making quantity and the corresponding maximum single-making quantity.

That is, in this step, budget allocation needs to be performed, the linear programming problem is solved, the maximum formed single quantity is taken as a target, the optimal hand order average strategy of each driver and the making single quantity under the single average strategy are calculated under the constraint of the total strategy budget, and then the total strategy is calculated according to the optimal hand order average strategy and the making single quantity under the single average strategy.

Specifically, the maximum order making quantity corresponding to each user under the order-average strategy is obtained according to a fitting curve of the order-average strategy of each user under the high order making quantity and the corresponding maximum order making quantity; and then calculating the product of the list average strategy of each user and the corresponding maximum list making quantity to obtain the total strategy of each user when the list average strategy is used for making the maximum list making quantity in the current day.

That is to say, after the training of the prediction model is completed, the driver expectation monotony-single quantity model is obtained, then, the other characteristics of each driver are kept unchanged, the monotony strategy is sampled from low to high, the sampled monotony strategy is input into the second-stage XGB model for curve fitting, a two-dimensional scatter diagram (monotony strategy, single quantity) of each driver is obtained, and a fitted monotony strategy-single quantity curve is obtained according to the two-dimensional scatter diagram (monotony strategy, single quantity), as shown in fig. 4, which is a schematic diagram of a monotony strategy-single quantity curve provided by the embodiment of the present invention.

As can be seen from fig. 4, when the order-average degree is higher than 5 yuan, the driver's order amount does not change as the order-average degree increases, because most of the historical data have the order-average degree of 5 yuan or less, and thus the existing model cannot determine the order amount in the higher order. Based on this, in order to solve the problem of cold start under different monetary rewards, the embodiment of the invention needs to respectively perform curve fitting on the two-dimensional points of the single-average strategy of each driver. The method comprises the following specific steps:

if a univariate strategy-to-univariate curve is defined to be fitted, the following equation (1) is given:

in the formula (1), k is a single-average strategy, a, b and c are constants, and the calculation is carried out according to the single-average strategies actually obtained by different users and the amount of the single-average strategies.

That is, a, b, c are determined by fitting a homography-to-monogram curve over a homography range of the historical data (e.g., the homography is ≦ 5). In the embodiment of the invention, even if the list-average strategy exceeds the range of the historical list-average strategy, the list-making quantity under the list-average strategy can be approximately obtained according to the list-making quantity curve chart, so that the optimal distribution of the subsequent list-average strategy becomes possible, and the optimal list-average strategy can be obtained.

And step 103 is executed, and the optimal target order-averaging strategy and the order-making quantity under the target order-averaging strategy obtained by each user are determined under the budget limit.

In this embodiment, the optimal hand order-averaging strategy (i.e., target order-averaging strategy) of each driver and the estimated making amount thereof under the order-averaging strategy (i.e., target order-averaging strategy) can be obtained through the following formula (2).

In this step, under the budget limit, an optimal target order-average strategy and a making order quantity under the target order-average strategy, which are obtained by each user, may be determined according to an order-average optimal allocation algorithm, where the order-average optimal allocation algorithm is implemented by the following formula (2):

wherein, in formula (2), max represents the total unit quantity of the whole body of the maximized each user; rij is a decision variable and indicates whether a j-th target list average strategy is allocated to the user i; ri, j represents the order making quantity of the user i under the jth target order average strategy, and the order making quantity is found out through a fitting curve; kj represents the single mean of the jth single mean strategy; k0, representing the total-to-average policy constraint for a user population, comprising: each user can only be assigned one single-average strategy, and the predicted total single of all users is equal to the single-average budget.

The model of the single average optimal distribution algorithm is as follows:

an objective function: the unit quantity of the whole driver is maximized, namely the unit quantity of all drivers.

Decision variables: xij, indicating whether the j-th unit is assigned to the driver i.

Constraint conditions are as follows: 1) Each driver can only be assigned a single-average strategy scheme; 2) The total forecast total for all drivers is equal to the average budget.

The setting is substituted into the formula (2) to be solved, and the optimal hand order average strategy of each driver and the predicted making order quantity under the order average strategy can be obtained.

And step 104 is executed, and a full order reward strategy is selected for each user from preset candidate reward strategies according to the total strategy of each user in the estimated making order quantity, the target order average strategy and the making order quantity under the target order average strategy.

Specifically, in this step, the product of the target order average strategy obtained by each user and the order quantity under the target order average strategy is calculated first, so as to obtain a corresponding product result; then, comparing the total strategy of each user when the order is estimated with the corresponding product result respectively to obtain corresponding difference values; and finally, selecting the candidate reward strategy with the minimum difference value from a preset candidate reward strategy set as a corresponding user full order reward strategy.

In the step, a total strategy when each user estimates the predicted making unit amount (namely the maximum making unit amount) and a corresponding target unit average strategy are selected to be multiplied by an award strategy which is closest to the making unit amount under the target unit average strategy to serve as the full unit award strategy of the corresponding user.

Optionally, an embodiment of the present invention further provides a method for selecting a full single prize reward policy, where the target is: after obtaining the optimal hand order average strategy of the driver and predicting the making amount, an optimal full order reward strategy scheme needs to be selected for the driver, so that the driver can most possibly reach the optimal order average strategy.

Such as: assume that the bonus strategy obtained from the set of candidate bonus strategies is exemplified by including a2 and a3, where a2= [0, 7], a3= [2, 0,10], if the estimated amount of driver a is 2 units and the optimal unit average policy is 2 yuan, then the total policy should be 2 × 2=4 yuan. And as can be known from the candidate reward strategy set, the difference between the product of the target list average strategy in the candidate reward strategy a3 and the list making quantity under the target list average strategy and the total strategy 4 yuan is the smallest, namely 0, so that the driver A can be enabled to approach the total strategy 4 yuan only by distributing the candidate reward strategy a3 to the driver A.

Specifically, the reward strategy configuration mode of the full single prize can be realized through formula (3).

In the formula (3), ki is an optimal list average strategy distributed to the driver i, ri is an estimated list making amount of the driver i when the ki list average strategy is distributed, st (ri) is an estimated total strategy of the driver i when the t-th list average strategy is distributed, and xij represents whether a full list award scheme t is distributed to the driver i.

Specifically, the reward schemes corresponding to the respective driver lists are recalled, and the closest full list reward strategy, namely, the total strategy in the process of predicting the list making amount and the strategy of multiplying the target list average strategy of the driver by the predicted list making amount, is selected as the reward strategy configured for the driver, namely, step 105 is executed.

Finally, step 105 is executed to use the full single reward policy selected for each user as the reward policy configured for the corresponding user.

In the embodiment of the invention, the order-averaging strategy obtained by each user and the order quantity under the order-averaging strategy are predicted; fitting the list-average strategy obtained by each user and the estimated list making quantity under the list-average strategy to obtain the estimated list making quantity of each user under the high list-average strategy and the total strategy of each user when the estimated list making quantity is obtained; under the budget limit, determining a target list average strategy of each user and a list making quantity under the target list average strategy; selecting a full order reward strategy for each user from a preset candidate reward strategy set according to the total strategy of each user when the estimated making order quantity is used, a target order average strategy and the making order quantity under the target order average strategy; and distributing the selected full single reward strategy to corresponding users. That is to say, in the embodiment of the present invention, the reward policy may be dynamically configured for any type of activity, without depending on the history data of a specific activity type. The method can be specifically applied to daily distribution and cold start of various reward activities based on prospective hand order-sharing strategy and ordering amount under the hand order-sharing strategy and combined with reward strategy configuration under the condition of full single prize. Particularly, for drivers with different order receiving capabilities, personalized reward strategies can be distributed to the drivers, so that the order receiving efficiency of the drivers is improved, and the traffic of the platform is also improved.

It is noted that while for simplicity of explanation, the method embodiments are shown as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may occur in other orders or concurrently with other steps in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no particular act is required to implement the invention.

Fig. 5 is a block diagram of a reward policy configuration apparatus according to an embodiment of the present invention. Referring to fig. 2, the apparatus includes a prediction module 501, a fitting process 502, a determination module 503, a selection module 504 and an assignment module 505, wherein,

the prediction module 501 is configured to predict a target order averaging strategy obtained by each user, and an order making amount under the target order averaging strategy;

the fitting processing module 502 is configured to perform fitting processing on the single-average policy obtained by each user and the estimated making single quantity under the single-average policy to obtain the estimated making single quantity of each user under the high single-average policy and the total policy of each user when the estimated making single quantity is obtained;

the determining module 503 is configured to determine, under the budget limit, a target order-averaging policy of each user and a making amount under the target order-averaging policy;

the selecting module 504 is configured to select a full order reward policy for each user from preset candidate reward policies according to the total policy of each user when the predicted making order amount is made, the target order average policy and the making order amount under the target order average policy;

the reward policy determination module 505 is configured to use the full single reward policy selected for each user as a reward policy configured for the corresponding user.

Optionally, in another embodiment, on the basis of the foregoing embodiment, the prediction module 501 includes: the first obtaining module 601, the sampling module 602 and the prediction processing module 603 are schematically shown in fig. 6, wherein,

the first obtaining module 601 is configured to obtain a bill-sharing policy, a history bill-making policy and a history data in a preset time period, which are obtained by each user, under the full bill reward;

the sampling module 602 is configured to sample the list-average policy of each user acquired by the first acquiring module 601 to obtain list-average sampling data;

the prediction processing module 603 is configured to input the order-averaging sampled data sampled by the sampling module 602, the historical order-making strategy acquired by the first acquisition module, and historical data in a preset time period into a trained prediction model for prediction processing, so as to obtain an order-averaging strategy obtained by each user, and an order-making quantity under the order-averaging strategy.

Optionally, in another embodiment, on the basis of the above embodiment, the apparatus further includes: a second obtaining module 701 and a prediction training module 702, which are schematically shown in fig. 7,

the second obtaining module 701 is configured to obtain variable data of each user;

the prediction training module 702 is configured to perform prediction training on the prediction model based on the variable data to obtain a trained prediction model.

Optionally, in another embodiment, on the basis of the foregoing embodiment, the variable data acquired by the second acquiring module includes: history list making and strategy, history list making and strategy of the user group, and list average strategy actually obtained (namely hands) on the same day by the user group;

the predictive training module includes: a first predictive training module and a second predictive training module, wherein,

the first prediction training module is used for inputting the history ordering and strategy, the history ordering and strategy of the user group and the list average strategy actually obtained in the same day of the user group into the first-stage XGB model for prediction training processing to obtain the list average strategy obtained in the same day of each user;

the second prediction training module is used for inputting the order-averaging strategy, the history order-making strategy and the history data in the preset time period obtained by each user into a second-level XGB model for prediction training processing to obtain the making order quantity under the order-averaging strategy, wherein the second-level XGB model is used as a trained prediction model.

Optionally, in another embodiment, on the basis of the foregoing embodiment, the fitting processing module includes: a curve fitting module, a single quantity estimation module and a first calculation module, wherein,

the curve fitting module is used for performing curve fitting on the single-average strategy obtained by each user and the single-quantity under the single-average strategy to obtain a fitting curve of the single-average strategy of each user and the corresponding maximum single-quantity under the high single-average strategy;

the order taking estimation module is used for estimating the estimated order taking amount of each user in the high order average strategy according to the fitting curve;

the first calculating module is used for calculating the product of the high-order-average strategy and the estimated ordering amount of each user under the high-order-average strategy to obtain the total strategy of each user when the estimated ordering amount is used.

Optionally, in another embodiment, on the basis of the above embodiment, the singleton estimation module estimates the estimated singleton of each user in the high-singleton-average strategy through the following fitting curve formula, where the fitting curve formula is:

wherein r (k) is an estimated making single quantity under high uniformity; k is the list-average strategy of each user; a, b and c are constants respectively, and are calculated through the list average strategy actually obtained by different users and the list amount of the list average strategy.

Optionally, in another embodiment, on the basis of the above embodiment, the determining module is specifically configured to determine, according to a list-average optimal allocation algorithm, an optimal target list-average policy obtained by each user and a list making amount under the target list-average policy under the budget quota, where the list-average optimal allocation algorithm is implemented by the following formula:

where max represents the total singleton that maximizes the ensemble of individual users; xi, j, which is a decision variable and indicates whether a jth target list average strategy is distributed to a user i; ri, j represents the order making quantity of the user i under the jth target order average strategy, and the order making quantity is found out through a fitting curve; kj represents the single mean of the jth single mean strategy; k0, representing the total singleton policy constraint for the user population, comprising: each user can only be assigned one single-average policy, and the predicted total single of all users is equal to the single-average budget.

Optionally, in another embodiment, on the basis of the above embodiment, the selecting block includes: a second calculation module, a comparison module, and a policy selection module, wherein,

the comparison module is used for comparing the target total strategy of each user when the single quantity is estimated with the corresponding product result to obtain a corresponding difference value;

the strategy selection module is used for selecting the candidate reward strategy with the minimum difference value from preset candidate reward strategies as a corresponding user full order reward strategy.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment related to the method, and reference may be made to part of the description of the embodiment of the method for the relevant points, and the detailed description will not be made here.

Optionally, an electronic device in an embodiment of the present invention includes:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the reward policy configuration method as described above.

Optionally, an embodiment of the present invention further provides a computer-readable storage medium, and when executed by a processor of an electronic device, the instructions in the computer-readable storage medium enable the electronic device to execute the reward policy configuration method described above.

Optionally, an embodiment of the present invention further provides a storage medium including instructions, for example, a memory including instructions, which are executable by a processor of an apparatus to perform the method described above. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Optionally, an embodiment of the present invention further provides a computer program product, which includes a computer program or instructions, and when executed by a processor, the computer program or instructions implement the reward policy configuration method described above.

Fig. 8 is a block diagram illustrating an electronic device 800 in accordance with an example embodiment. For example, the electronic device 800 may be a mobile terminal or a server, and in the embodiment of the present invention, the electronic device is taken as a mobile terminal as an example for description. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the reward policy configuration method shown above.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the reward policy configuration method illustrated above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, in which instructions, when executed by the processor 820 of the electronic device 800, cause the electronic device 800 to perform the reward policy configuration method illustrated above.

Fig. 9 is a block diagram illustrating an apparatus 900 for reward policy configuration, according to an example embodiment. For example, the apparatus 900 may be provided as a server. Referring to fig. 9, the apparatus 900 includes a processing component 922, which further includes one or more processors and memory resources, represented by memory 932, for storing instructions, such as applications, that may be executed by the processing component 922. The application programs stored in the memory 932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 922 is configured to execute instructions to perform the method reward policy configuration method described above.

The device 900 may also include a power component 926 configured to perform power management of the device 900, a wired or wireless network interface 950 configured to connect the device 900 to a network, and an input output (I/O) interface 958. The apparatus 900 may operate based on an operating system, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM or the like, stored in the memory 932.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes can be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A reward policy configuration method, comprising:

fitting the list-average strategy obtained by each user and the estimated list making quantity under the list-average strategy to obtain the estimated list making quantity of each user under the high list-average strategy and the total strategy of each user when the estimated list making quantity is obtained;

selecting a full order reward strategy for each user from a preset candidate reward strategy set according to the total strategy of each user when the predicted making order quantity is used, a target order average strategy and the making order quantity under the target order average strategy;

and taking the full single reward strategy selected for each user as a reward strategy configured for the corresponding user.

2. The reward strategy configuration method of claim 1, wherein the predicting the order-average strategy obtained by each user and the amount of orders made under the order-average strategy comprises:

under the condition of full order reward, obtaining an order average strategy, a historical order making strategy and historical data in a preset time period, which are obtained by each user;

3. A reward strategy configuration method according to claim 2, characterized in that the method further comprises:

acquiring variable data of each user;

4. A reward strategy configuration method according to claim 3, characterized in that said variable data comprise: history list making and strategy, history list making and strategy of the user group, and list average strategy actually obtained in the same day by the user group;

and inputting the list average strategy, the history list making strategy and the history data in a preset time period obtained by each user into a second-level XGB model for prediction training processing to obtain the list making quantity under the list average strategy, wherein the second-level XGB model is used as a trained prediction model.

5. The reward strategy configuration method according to claim 1, wherein said fitting the predicted single-average strategy obtained by each user and the predicted single-making quantity under the single-average strategy to obtain the estimated single-making quantity of each user under the high single-average strategy, and the total strategy of each user when the estimated single-making quantity is obtained comprises:

performing curve fitting on the single-average strategy obtained by each user and the single making quantity under the single-average strategy to obtain a fitting curve of the single-average strategy of each user and the corresponding maximum single making quantity under the high single-average strategy;

according to the fitting curve, estimating estimated ordering quantity of each user in the high order-average strategy;

6. The reward strategy configuration method of claim 5, wherein the predicted amount of each user in the high-uniformity strategy is predicted by the following fitting curve formula, wherein the fitting curve formula is as follows:

7. The reward strategy configuration method of claim 1, wherein the determining of the optimal target order-averaged strategy and the making amount under the target order-averaged strategy obtained by each user under the budget quota comprises:

under the budget limit, determining an optimal target list average strategy obtained by each user and a list making quantity under the target list average strategy according to a list average optimal distribution algorithm, wherein the list average optimal distribution algorithm is realized by the following formula:

where max represents the total singleton that maximizes the ensemble of individual users; x is the number of _i,j If so, the decision variable represents whether a jth target list average strategy is distributed to the user i; r is _i,j Representing the order making quantity of the user i under the jth target order average strategy, and finding the order making quantity through a fitting curve; k is a radical of _j A single mean value representing a jth single mean strategy; k is a radical of ₀ S.t. represents the total-order-average policy constraint for a user population, including: each user can only be assigned one single-average strategy, and the predicted total single of all users is equal to the single-average budget.

8. The reward strategy configuration method according to claim 1, wherein the selecting a full single reward strategy for each user from preset candidate reward strategies according to the total strategy of each user when the estimated make single amount is used, a target single average strategy and the make single amount under the target single average strategy comprises:

9. A reward policy configuration apparatus, comprising:

the prediction module is used for predicting a target list average strategy obtained by each user and the list making quantity under the target list average strategy;

the determining module is used for determining target list average strategies of all users and list making quantity under the target list average strategies under the budget limit;

and the reward strategy determination module is used for taking the full single reward strategy selected for each user as a reward strategy configured for the corresponding user.

10. The reward strategy configuration device of claim 9, wherein the prediction module comprises:

the sampling module is used for sampling the list-average strategy of each user acquired by the first acquisition module to obtain list-average sampling data;

and the prediction processing module is used for inputting the singly-averaged sampling data sampled by the sampling module, the historical ordering and strategy acquired by the first acquisition module and the historical data acquired within a preset time period into a trained prediction model for prediction processing to acquire singly-averaged strategies acquired by each user and ordering quantities under the singly-averaged strategies.

11. The reward policy configuration device of claim 10, wherein the device further comprises:

the second acquisition module is used for acquiring variable data of each user;

12. The reward policy configuration device according to claim 11, wherein the variable data acquired by the second acquisition module includes: history list making and strategy, history list making and strategy of the user group, and list average strategy actually obtained in the same day by the user group;

the predictive training module includes:

and the second prediction training module is used for inputting the order-averaging strategy, the history order-making strategy and the history data in the preset time period obtained by each user into a second-stage XGB model for prediction training processing to obtain the order-making quantity under the order-averaging strategy, wherein the second-stage XGB model is used as a trained prediction model.

13. The reward policy configuration device of claim 8, wherein the fitting processing module comprises:

14. The reward strategy configuration device of claim 13, wherein the singleton estimation module estimates the estimated singleton of each user in the high-singleton strategy by using the following fitting curve formula, wherein the fitting curve formula is:

15. The reward policy configuration device of claim 9,

the determining module is specifically configured to determine, under a budget limit, an optimal target order-averaging strategy obtained by each user and a making order quantity under the target order-averaging strategy according to an order-averaging optimal allocation algorithm, where the order-averaging optimal allocation algorithm is implemented by the following formula:

where max represents the total singleton that maximizes the ensemble of individual users; x is the number of _i,j If so, the decision variable represents whether a jth target list average strategy is distributed to the user i; r is _i,j Representing the order making quantity of the user i under the jth target order average strategy, and finding the order making quantity through a fitting curve; k is a radical of _j A single mean value representing a jth single mean strategy; k is a radical of formula ₀ S.t. represents a total-to-average policy constraint for a user population, including: each user can only be assigned one single-average policy, and the predicted total single of all users is equal to the single-average budget.

16. The reward policy configuration device of claim 9, wherein the selection module comprises:

17. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement a reward policy configuration method according to any of claims 1 to 8.

18. A computer readable storage medium, wherein instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform a reward policy configuration method of any of claims 1 to 8.

19. A computer program product comprising a computer program or instructions which, when executed by a processor, implements a reward policy configuration method as claimed in any one of claims 1 to 8.