CN111027676A

CN111027676A - Target user selection method and device

Info

Publication number: CN111027676A
Application number: CN201911194019.3A
Authority: CN
Inventors: 李晨晨; 阎翔; 乔俊龙; 屈超; 熊君武; 宋乐
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2020-04-17
Anticipated expiration: 2039-11-28
Also published as: CN111027676B

Abstract

The embodiment of the specification provides a target user selection method and device, wherein the method comprises the following steps: for each user in the user group to be selected, the following processing is respectively executed: inputting the user characteristics of the user into a pre-trained strategy decision network to obtain an operation reward value corresponding to the target business operation predicted and output by the strategy decision network, wherein the operation reward value is used for expressing a net promotion response predicted value after the target business operation is executed on the user; and selecting the user with the operation reward value meeting the screening condition as the target user according to the operation reward value of each user in the user group to be selected.

Description

Target user selection method and device

Technical Field

The present disclosure relates to the field of machine learning technologies, and in particular, to a method and an apparatus for selecting a target user.

Background

Such situations are often encountered in marketing: many studies are made by marketing departments, and users are selected as marketing targets according to the characteristics of users, which are considered to have grasped the characteristics of the users. But the results were disappointing after the activity was deduced: there was no significant difference in the net lift response of the trial group (users who participated in marketing) and the control group (users who did not participated in marketing). This occurs because without distinguishing between users who may be affected by marketing and users who may not be, the net promotion response should be maximized by looking for users who may be affected by marketing to market. The users who can be affected by the marketing, namely the responses of the users under the marketing condition, are obviously different from the responses under the no marketing condition.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure provide a method and apparatus for selecting a target user.

Specifically, one or more embodiments of the present disclosure are implemented by the following technical solutions:

in a first aspect, a method for selecting a target user is provided, where the method is used to select a part of users from a group of users to be selected as target users to perform a target service operation on the target users, and the method includes:

for each user in the user group to be selected, respectively executing the following processing: inputting the user characteristics of the user into a pre-trained strategy decision network to obtain an operation reward value corresponding to the target business operation predicted and output by the strategy decision network, wherein the operation reward value is used for expressing a net promotion response predicted value after the target business operation is executed on the user;

and selecting the user with the operation reward value meeting the screening condition as the target user according to the operation reward value of each user in the user group to be selected.

In a second aspect, a target user selection apparatus is provided, where the apparatus is used to select a part of users from a group of users to be selected as target users to perform target business operations on the target users, and the apparatus includes:

the prediction output module is used for respectively executing the following processing for each user in the user group to be selected: inputting the user characteristics of the user into a pre-trained strategy decision network to obtain an operation reward value corresponding to the target business operation predicted and output by the strategy decision network, wherein the operation reward value is used for expressing a net promotion response predicted value after the target business operation is executed on the user;

and the user selection module is used for selecting the user with the operation reward value meeting the screening condition as the target user according to the operation reward value of each user in the user group to be selected.

In a third aspect, an electronic device is provided that includes a memory for storing computer instructions executable on a processor, a processor; the processor is configured to implement the method for target user selection according to any of the embodiments of the present specification when executing the computer instructions.

In the method and the device for selecting the target user according to one or more embodiments of the present specification, a policy decision network is used to predict a net promotion response obtained when a user performs a business operation, so that a final target user can be selected from a user group according to the operation reward value output by the network, thereby obtaining a better response effect; in addition, the method also has better generalization capability and expansibility, and is also suitable for the strategy decision network when other user groups except the training sample set select the target user.

Drawings

In order to more clearly illustrate one or more embodiments or technical solutions in the prior art in the present specification, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in one or more embodiments of the present specification, and other drawings can be obtained by those skilled in the art without inventive exercise.

FIG. 1 is a schematic diagram of a reinforcement learning network training scheme provided in at least one embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a cumulative gain difference provided in at least one embodiment of the present description;

FIG. 3 is a schematic diagram of a deep neural network training provided in at least one embodiment of the present disclosure;

fig. 4 is a training process of a deep neural network provided in at least one embodiment of the present description;

FIG. 5 is a flow diagram of a target user selection provided in at least one embodiment of the present description;

FIG. 6 illustrates a configuration of a target user selection device according to at least one embodiment of the present disclosure;

fig. 7 is a structure of a selection device of a target user according to at least one embodiment of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in one or more embodiments of the present disclosure, the technical solutions in one or more embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in one or more embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all embodiments. All other embodiments that can be derived by one of ordinary skill in the art from one or more embodiments of the disclosure without making any creative effort shall fall within the scope of the disclosure.

The embodiment of the specification provides a method for selecting target users, wherein the target user selection can be to select a part of users from a group of users to be selected as target users so as to execute certain business operation on the target users.

Taking a marketing scenario as an example, the group of users to be selected may be 300 people, and 120 people from the 300 people are desired to be selected for marketing promotion, for example, 120 prizes may be issued (for example, a specific prize is delivered when the product P is purchased) or coupons may be issued to encourage users to purchase the product P. The 120 individual may be referred to as a target user to whom the prize or coupon is issued, i.e., to whom the business operation is performed (i.e., to apply the benefit). Also, the 120 individuals who want to be selected as described above should be "marketing-sensitive" users, i.e., those who are most likely to be significantly driven and affected by the marketing campaign.

The method of the embodiment of the present specification will describe how to select the target user from the group of users to be selected, and make the effect after the target service operation is performed on the part of target users better. The method can train an Uplift Model (also called a net lifting Model) in a reinforcement learning training mode, the net lifting Model can be deployed in an Agent (intelligent Agent) of reinforcement learning, and the Agent outputs the basis information for selecting a target user (for example, a user with 'marketing sensitivity') according to the net lifting Model.

As shown in FIG. 1, FIG. 1 illustrates a network training mode of reinforcement learning, an Agent (Agent) selects to execute an action for an environment, the environment can change the state after receiving the action, and a reward is returned to the Agent. The Agent can generate the next action according to the reward rewarded and the current state, the reward rewarded can be used as a guide for the Agent to select the executed action, and the Agent continuously adjusts the selection of the learning action so that the reward rewarded generated by the environment is continuously enhanced when the selected action acts on the environment. After a plurality of iterations, the Agent can learn an action strategy which can enable the environmental reward to be higher.

With reference to fig. 1, specifically to an application scenario of a target user in the embodiment of the present specification, taking a selection scenario of a marketing user as an example, each factor in the reinforcement learning may be defined as follows:

illustratively, still taking the above-mentioned example of selecting 120 persons as the target users from the users 300 of the group of users to be selected as examples, the selected 120 persons are issued prizes (a specific prize is sent for purchasing the product P) to encourage the persons to purchase the product P; no prize is awarded to the remaining 180 people. Accordingly:

action: two actions can be included, action-1 and action-2, where action-1 is "award" and action-2 is "no award".

State: user characteristics of individual users. For example, for a certain person of 300 persons, the gender, age, geographic location information (e.g., where the person is), commodity purchase history data, and the like of the user may be determined autonomously as long as any characteristics that may affect the commodity purchase prediction of the user may be used, and the present embodiment does not limit the specifically adopted characteristics.

Agent of the Agent: and the system is used for receiving the state and predicting and outputting a net lifting response predicted value after the target business operation is executed on the user. For example, in an actual implementation, the net boost response prediction value may not necessarily be a real net boost response value, but may be a representative value capable of representing the magnitude of the net boost response, for example, the net boost response prediction value may be represented by a probability value, and a higher probability value indicates a larger net boost response. For example, when the User characteristics of one User1 in 300 persons are input into the Agent, the Agent can predict that the output probability of "distributing prizes" for the User is 90%, and the probability of "distributing prizes" corresponding to another User2 is 70%, and 90% is greater than 70%, so that it can be determined that greater net promotion response can be brought to the distribution of prizes for the User1 than to the User 2.

It should be noted that, the Agent may respectively predict the probability of outputting each action corresponding to each User, for example, the Agent may input the User characteristics of the User1 into the Agent to predict the probability of outputting the action-1 and the action-2 corresponding to the User 1; the User characteristics of the User2 are input into the Agent, and the probabilities of the action-1 and the action-2 corresponding to the output User2 are predicted.

As above, each action corresponds to a business operation, where the probabilities of action-1 and action-2 correspond to the operation reward values corresponding to the respective business operations, which can be used as the basis for the subsequent selection of the target user. In addition, in this embodiment, for the moment, two service operations, "award dispensing" and "no award dispensing" are taken as examples, and the specific implementation is not limited thereto, and there may be more service operations. Also, the number of action may not be limited to two, and at least one target business operation (e.g., action-1 described above) and at least one other business operation (e.g., action-2 described above) may be included.

Environment: the environment in this embodiment may be understood as that, on the basis that the Agent predicts the net promotion response prediction values corresponding to the action-1 and the action-2 of each user, the following two kinds of processing may be further performed: firstly, determining a target user and a non-target user; second, the cumulative gain difference is obtained. The following description is given by way of example where the net boost response prediction value is a probability value.

First, it is determined that action-1 "awards" is a target business operation, and the probability of action-1 can be determined as a target operation reward value. Then, the target user is selected according to the probability of action-1 of each user in the user group 300 to be selected. For example, the probabilities of actions-1 of the users may be sorted from high to low, and a part of the users may be selected as target users according to the sorting result. For example, users ranked at a preset number of digits (e.g., top 20%) may be selected as target users. For another example, a probability threshold may be set, and a user whose action-1 probability is higher than the probability threshold may be the target user. After the target user is determined, the rest users in the user group to be selected can be called non-target users.

Acquiring a cumulative gain difference: after the target users and the non-target users are distinguished in the user group to be selected, the cumulative gain difference can be calculated. The process of obtaining the cumulative gain difference is described in detail as follows:

for example, still taking the group of users to be selected 300 as an example, the selected target users for distributing prizes are 120, and the remaining 180 are non-target users. In this embodiment, the data used for calculating the cumulative gain difference is a "response value" in the history information of the user, the cumulative gain difference is used in the training phase of the policy decision network, and each training sample in the training sample set used for training the network may include: user characteristics of a sample user, and a response value after performing a target business operation for the sample user. For example, one training sample may include { User characteristics of User1, reward response value to User1 }, and another training sample may include { User characteristics of User2, reward response value to User2 }. Each training sample in the set of training samples may include the response value. These response values may be user history data collected during previous marketing.

The above-mentioned response value may be, for example, "when a product is purchased, the response value is 1" and when no product is purchased, the response value is 0 ", but this is merely an example and the actual implementation is not limited to this. For example, other values such as the amount of consumption of the user (e.g., 1000 dollars consumed, 200 dollars consumed) may be used as the response value.

Then, taking a target user group of 120 users as an example, 10% (which may be referred to as a user ratio) of the users may be extracted according to the calculated probability corresponding to the action-1, for example, after the users are sorted in the order from high to low according to the probability of the network prediction output, the number of the users listed in the front is selected, and the probability of the users is high, so that the expected net promotion response is considered to be high, and the users belong to "marketing-sensitive users". The respective "response values" of the users are accumulated, e.g., 25% + 12% + …, to obtain an accumulated sum. Referring to fig. 2, fig. 2 is a schematic diagram illustrating the principle of accumulated gain difference. The abscissa of fig. 2 may be a user proportion and the ordinate may be an accumulated sum corresponding to the user proportion, for example, "10% of user proportion and its corresponding accumulated sum" in the above example may correspond to one curve sample point 21 in fig. 2, the abscissa of the curve sample point 21 is 10%, and the ordinate is the corresponding accumulated sum.

Similarly, a 20% user percentage of the target group of 120 users may be extracted, and the respective "response values" of these users are also accumulated to obtain an accumulated sum, which corresponds to another curve sample point 22 in fig. 2. Other user proportions can also be extracted to obtain corresponding other curve sample points. Fitting a plurality of curve sample points corresponding to the plurality of user ratios may result in the first curve or the second curve shown in fig. 2. The first curve is the upper curve in fig. 2, and is obtained by extracting different proportions from the target user group and accumulating, and the first curve may be obtained by fitting a plurality of curve sample points corresponding to the target sample user set. The second curve is obtained by extracting different proportions from the non-target user group and accumulating, and can be obtained by fitting a plurality of curve sample points corresponding to the non-target user group. The second curve is different from the first curve in that the extraction of users in different proportions in the second curve is obtained by random extraction, which is different from the extraction of users in probability of action-1 in the first curve, for example, 20% of users are extracted in the order of probability sorting.

As shown in fig. 2, the first curve and the second curve have an overlapped curve start point 23 and a curve end point 24, and the area of the surrounding region between the first curve and the second curve, which is the value of the cumulative gain difference, is calculated. This cumulative gain difference may be referred to as AUCC (area Under vertical Current).

Reward: the environment can return the accumulated gain difference obtained by the calculation to the Agent of the Agent as reward, so that the Agent can adjust the network parameters according to the reward, and obtain the operation reward values corresponding to the updated actions in the next round of training. For example, in one iteration the probability of action-1 corresponding to User1 is 90%, while in the next iteration the probability of action-1 corresponding to User1 may be updated to 75%.

As described above, after the update of the operation reward value corresponding to each action is fed back to the environment, the selection of the environment to the target user and the adjustment of the cumulative gain difference may be affected, and then the reward rewarded subsequently received by the Agent may be further changed.

In this embodiment, the Agent may be a network structure that employs a deep neural network when predicting and outputting the probability of each service operation corresponding to the user according to the received user state. Referring to the illustration of fig. 3, the input of the deep neural network is a state (e.g., a user characteristic of a certain user), and the predicted output of the network is an operation reward value corresponding to each service operation.

How to train the deep neural network in the reinforcement learning manner is described as follows, in the process of training the deep neural network in the reinforcement learning manner, the environment-generated cumulative gain difference AUCC is used as a reward returned to the deep neural network to guide the training of the deep neural network. In the training process, the deep neural network may perform multiple iterations, and network parameters of the deep neural network are adjusted so that the AUCC gradually increases.

The training process of the network comprises the following steps:

referring to fig. 4, a training process of a deep neural network, which may also be referred to as a policy decision network, is illustrated in one example. As shown in fig. 4, the training process may include:

in step 400, the user characteristics of each sample user in the training sample set are input into the policy decision network to be trained, respectively.

In this step, a training sample set may be obtained, where each training sample in the training sample set includes: user characteristics of a sample user, and a response value after performing a target business operation for the sample user. For example, the training sample set may include User characteristics of User1, User characteristics of User2, and the like, each User may be a sample User, and the User characteristics may include characteristics of age, gender, and the like of the User.

For each sample user in the training sample set, the user characteristics of each sample user can be input into the strategy decision network, and the output corresponding to each sample user is predicted and output.

In step 402, an operation reward value corresponding to each service operation in the service operation set corresponding to the sample user and predicted and output by the policy decision network is obtained.

For example, taking the example of executing a marketing strategy to a user (e.g., distributing a prize to the user), the "distributing a prize" may be used as one business operation, and the "not distributing a prize" may be used as another business operation; and take the example that the operational reward value is a probability value.

For each sample User, the policy decision network may predict and output two probabilities corresponding to the sample User, where one is a probability of "distributing a prize" and the other is a probability of "not distributing a prize", for example, the probability of "distributing a prize" corresponding to the User1 is 90%, and a larger probability value indicates a higher expected net promotion response after distributing a prize to the User.

For each sample user in the training sample set, the above two probabilities can be obtained separately.

In step 404, according to the operation reward value of each sample user in the training sample set, a sample user with an operation reward value meeting the screening condition is selected as a sample target user.

In this embodiment, for example, if "award delivery" is used as the target service operation, the probability of "award delivery" may be referred to as a target operation reward value. For example, the probabilities of "awarding prizes" of the sample users in the training sample set may be ranked from high to low, and the sample user ranked at the top by a preset number of bits may be selected as the target sample user.

In step 406, a sub-user group with multiple user ratios is selected according to the operation reward value, a target service operation is performed on each sample user in the sub-user group, and for each user ratio, the response values of each sample user in the sub-user group after the target service operation is performed are accumulated to obtain an accumulated sum corresponding to the user ratio.

For example, assuming that the target sample user obtained in step 404 has 120 persons, these 120 persons may be referred to as a target sample user set. In this step, a sub-user group with a plurality of user ratios can be selected according to the target sample user set and the operation reward values, and the actual response values of each target sample user of the sub-user group are accumulated to obtain a corresponding accumulated sum.

For example, in the target sample user set, according to the order from high to low of the operation reward value (e.g., "probability of distributing a prize"), the top 10% of the sub-user groups 10 are selected, and the response values in the training samples corresponding to these 10 persons are accumulated to obtain the accumulated sum corresponding to 10%.

For another example, in the target sample user set, the first 20% of the sub-user groups 20 are selected in the order of the operation reward value (e.g., the probability of "distributing a prize") from high to low, and the response values corresponding to the 20 persons are accumulated to obtain the accumulated sum corresponding to 20%.

In addition, the training sample set is subtracted from the target sample user set, and the remaining users are the non-sample target user set. For the non-sample target user set, sub-user groups with different user ratios of 10%, 20% and the like can be extracted, and the net lifting responses of all the non-target sample users of the sub-user groups are accumulated to obtain corresponding accumulated sums. In contrast, the extraction of the sub-user groups in the non-sample target user set may be a random extraction, while the extraction of the sub-user groups in the target sample user set is a probabilistic extraction.

In step 408, the user proportion is taken as an abscissa, and the accumulated sum corresponding to the user proportion is taken as an ordinate, so as to obtain a corresponding curve sample point; and fitting a plurality of curve sample points corresponding to a plurality of user proportions to obtain a first curve or a second curve.

This step may obtain a first curve and a second curve.

The abscissa of the first curve and the abscissa of the second curve may be a user ratio, for example, 10% and 20%, and the ordinate of the vertical axis may be a cumulative sum corresponding to the user ratio, for example, a cumulative sum corresponding to 10% and a cumulative sum corresponding to 20%. A user ratio and its corresponding accumulated sum may be taken as a curve sample point. For example, the 10% corresponding accumulated sum may be obtained by accumulating the response values of each sample user in the extracted 10% sub-user group, and 10% and its corresponding accumulated sum may obtain a curve sample point, i.e. the abscissa of the curve sample point is 10%, and the ordinate may be the corresponding accumulated sum.

The first curve and the second curve have overlapped curve starting points and curve end points, and the curve is obtained by fitting the user proportions, the accumulations corresponding to the user proportions and the curve sample points corresponding to the accumulations.

In step 410, the area between the first curve and the second curve is calculated, resulting in a cumulative gain difference.

The cumulative gain difference is an area of an enclosed region between a first curve and a second curve, the first curve and the second curve having overlapping curve start points and curve end points.

In step 412, the cumulative revenue difference is returned as a reward to the policy decision network.

In this step, the AUCC may be returned to the policy decision network as a reward.

In step 414, the policy decision network adjusts network parameters according to the rewards, so as to obtain operation reward values corresponding to each updated service operation in the next round of training.

For example, the policy decision network may adjust its own network parameters according to the reward AUCC, and after the parameters are adjusted, if the user characteristics of one sample user are input into the policy decision network, the operation reward values corresponding to the respective service operations may change, and then the target sample user selected according to the operation reward values and the subsequently calculated AUCC may change.

It should be noted that the adjustment of the network parameters of the policy decision network may be performed after all sample users in the training sample set predict and output operation reward values corresponding to each service operation, and calculate the AUCC and return the AUCC to the policy decision network. For example, assuming that there are 300 training sample sets, it is necessary to obtain operation reward values of respective service operations corresponding to the 300 training sample sets, distinguish a target sample user and a non-target sample user according to the operation reward values of the 300 training sample sets, and return the AUCC to the policy decision network after calculating the AUCC according to the foregoing method, so that the network performs adjustment of network parameters, which is equivalent to performing one network iteration.

The strategy determines the training end condition of the network and can be set independently. For example, when the rise of the AUCC does not change significantly and is substantially at a more stable value, the training of the policy decision network may be ended. For another example, the network training may be terminated after a predetermined number of iterations have been performed.

In this embodiment, the policy decision network is trained, and by designing the policy decision network into a form of a neural network, the strong generalization capability of the neural network can be utilized, so that the neural network can be used for making a decision for a state that does not occur in a training sample set, for example, for some users outside the training sample set, only by inputting the user characteristics of the users into the policy decision network, the net promotion response prediction after the user performs the business operation corresponding to the user can be predicted, and then the target user is selected according to the probability. In addition, the strategy decision network is trained by using a reinforcement learning mode, so that the strategy decision network can be suitable for various scenes, for example, the specific action is not limited, and the accumulated gain difference serving as the reward value can be response values in various forms, such as the promotion of the consumption amount and the like.

The application process of the network comprises the following steps: for selecting target user from user group to be selected

FIG. 5 illustrates a process for performing target user selection using a trained policy decision network, the application process of which may be performed at an Agent. The target user may be selected from a group of users to be selected as a target user, so as to execute a target service operation corresponding to the target user. The action in the training phase and the action in the use phase of the policy decision network are the same, and the feature formats of the user features input into the network may be the same, for example, the user features input during training are age and gender, and the features input during application are also age and gender.

As shown in fig. 5, the method for selecting the target user may include:

in step 500, for each user in the group of users to be selected, the following processing is respectively performed: and inputting the user characteristics of the user into a pre-trained strategy decision network to obtain operation reward values respectively corresponding to all the service operations in the service operation set output by the strategy decision network in a prediction mode.

In this step, the group of users to be selected is, for example, a group of 800 persons. The business operations may include, for example, "issue coupon", "not issue coupon". The policy decision network may predict and output operation reward values corresponding to the two actions "issue coupon" and "not issue coupon", respectively, and the operation reward values are corresponding probabilities of the actions, for example.

The 800 persons in the group of the users to be selected can input the user characteristics of each user into the policy decision network respectively, and output the probabilities of the two actions corresponding to the user. For example, the User profile of User3 is input into the policy decision network and the probabilities of the two actions corresponding to User3 are output. A higher probability indicates a greater net predicted boost response for that user.

In step 502, according to the target operation reward value of each user in the user group to be selected, the user with the target operation reward value meeting the screening condition is selected as the target user.

In this step, according to the execution probability of "issuing coupons" of each user in the user group to be selected predicted in step 500, the users ranked in the top m% are selected as the target users to issue coupons, wherein the specific value of m can be determined according to the actual business requirements.

For example, the target operation reward value may be an operation reward value corresponding to the target service operation in the operation reward values corresponding to the respective service operations. Illustratively, the targeted business operation may be the execution of a marketing strategy for the user, and the non-execution of a marketing strategy for the user may be a non-targeted business operation.

In the method for selecting the target user, the net promotion response obtained when the user executes the business operation is predicted by using the policy decision network, so that the final target user can be selected from the user group according to the operation reward value output by the network, and a better response effect is obtained; in addition, the method also has better generalization capability and expansibility, and is also suitable for the strategy decision network when other user groups except the training sample set select the target user.

Fig. 6 is a block diagram of a target user selection apparatus provided in at least one embodiment of the present specification, where the apparatus is used to select a part of users in a group of users to be selected as target users to perform a target business operation on the target users. As shown in fig. 6, the apparatus may include: a prediction output module 61 and a user selection module 62.

A prediction output module 61, configured to perform the following processing for each user in the user group to be selected: inputting the user characteristics of the user into a pre-trained strategy decision network to obtain an operation reward value corresponding to the target business operation predicted and output by the strategy decision network, wherein the operation reward value is used for expressing a net promotion response predicted value after the target business operation is executed on the user.

And the user selection module 62 is configured to select, according to the operation reward value of each user in the to-be-selected user group, a user whose operation reward value meets the screening condition as the target user.

In one example, the prediction output module 61, when configured to obtain an operation reward value corresponding to the target business operation output by the policy decision network prediction, includes: and obtaining operation reward values corresponding to each service operation predicted and output by the strategy decision network, wherein each service operation comprises the target service operation and at least one other service operation.

In one example, the target business operation is indicative of execution of a marketing strategy for the user and the other business operations are indicative of non-execution of a marketing strategy for the user.

In one example, the user selection module 62 is specifically configured to: sorting the operation reward values respectively corresponding to all the users in the user group to be selected; and selecting part of users in the user group to be selected as the target users according to the sorting result.

In one example, as shown in fig. 7, the apparatus may further include: and the network training module 71 is used for training the deep neural network as the strategy decision network in a reinforcement learning mode.

The network training module 71 may include:

a sample obtaining sub-module 711, configured to obtain a training sample set, where each training sample in the training sample set includes: user characteristics of a sample user, and a response value after performing a target business operation for the sample user.

And the prediction output sub-module 712 is configured to input the user characteristics of each sample user into the policy decision network to be trained, so as to obtain an operation reward value of the target service operation corresponding to the sample user, which is predicted and output by the policy decision network.

And the screening processing sub-module 713 is configured to select, according to the operation reward value of each sample user, a plurality of sample users whose operation reward values meet the screening condition from the training sample set as a target sample user set, where remaining users in the training sample set are used as a non-target sample user set.

A reward processing sub-module 714 for determining a cumulative gain difference between the target sample user set and a non-target sample user set according to the response value; and returning the accumulated revenue difference to the strategy decision network as a reward value.

And the parameter adjusting submodule 715 is configured to adjust a network parameter of the policy decision network according to the reward value, so as to obtain an updated operation reward value in a next round of training.

In one example, reward processing sub-module 714, when configured to determine a cumulative gain difference between the target set of sample users and a non-target set of sample users based on the response value, comprises:

selecting sub-user groups with various user proportions from the set, wherein the sub-user groups are selected from a target sample user set according to an operation reward value or are randomly selected from a non-target sample user set;

for each user proportion, accumulating the response values of all sample users in the sub-user group to obtain an accumulated sum corresponding to the user proportion; taking the user proportion as an abscissa and the accumulated sum corresponding to the user proportion as an ordinate to obtain a corresponding curve sample point;

fitting a plurality of curve sample points corresponding to a plurality of user proportions to obtain a first curve and a second curve, wherein the first curve is obtained by fitting a plurality of curve sample points corresponding to a target sample user set, and the second curve is obtained by fitting a plurality of curve sample points corresponding to a non-target sample user set;

acquiring the area of the enclosed region between the first curve and the second curve as the cumulative gain difference between the target sample user set and the non-target sample user set.

Embodiments of the present specification also provide an electronic device that may include a memory for storing computer instructions executable on a processor, a processor; the processor is configured to implement the method for target user selection according to any of the embodiments of the present specification when executing the computer instructions.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

One or more embodiments of the present description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A method for selecting target users is used for selecting part of users from a group of users to be selected as target users to execute target service operation on the target users, and comprises the following steps:

2. The method of claim 1, the inputting the user characteristics of the user into a pre-trained policy decision network, comprising:

inputting at least one of the following user characteristics of the user into a pre-trained policy decision network: user age, user gender, user geographic location information, or user's merchandise purchase history data.

3. The method of claim 1, wherein obtaining the operation reward value corresponding to the target business operation predicted to be output by the policy decision network comprises:

and obtaining operation reward values corresponding to each service operation predicted and output by the strategy decision network, wherein each service operation comprises the target service operation and at least one other service operation.

4. The method of claim 3, the targeted business operation to represent execution of a marketing strategy for the user, the other business operation to represent non-execution of a marketing strategy for the user.

5. The method according to claim 1, wherein the selecting, according to the operation reward value of each user in the user group to be selected, a user whose operation reward value meets a screening condition as the target user comprises:

sorting the operation reward values respectively corresponding to all the users in the user group to be selected;

and selecting part of users in the user group to be selected as the target users according to the sorting result.

6. The method of claim 5, the operational reward value being a probability value; the sorting the operation reward values respectively corresponding to the users in the user group to be selected includes:

sorting probability values corresponding to all users in the user group to be selected in sequence from high probability values to low probability values; higher probability values indicate a larger net boost response.

7. The method according to any one of claims 1 to 6, wherein the strategy decision network is a deep neural network trained by reinforcement learning.

8. The method of claim 7, the training process of the policy decision network comprising:

obtaining a set of training samples, each training sample in the set of training samples comprising: the user characteristics of the sample user and the response value after the target business operation is executed on the sample user;

respectively inputting the user characteristics of each sample user into a strategy decision network to be trained to obtain an operation reward value of the target business operation corresponding to the sample user predicted and output by the strategy decision network;

selecting a plurality of sample users with operation reward values meeting screening conditions from a training sample set as a target sample user set according to the operation reward values of the sample users, and taking the rest users in the training sample set as a non-target sample user set;

determining a cumulative gain difference between the target sample user set and a non-target sample user set according to the response value;

and returning the accumulated income difference to the strategy decision network as a reward value, and adjusting the network parameters of the strategy decision network according to the reward value.

9. The method of claim 8, the determining, from the response value, a cumulative gain difference between the target set of sample users and a non-target set of sample users, comprising:

10. A selection apparatus of a target user, the apparatus being used for selecting a part of users from a group of users to be selected as target users to perform a target business operation on the target users, the apparatus comprising:

11. The apparatus of claim 10, wherein the first and second electrodes are disposed on opposite sides of the substrate,

the prediction output module, when configured to obtain an operation reward value corresponding to the target service operation that the policy decision network predicts to output, includes: and obtaining operation reward values corresponding to each service operation predicted and output by the strategy decision network, wherein each service operation comprises the target service operation and at least one other service operation.

12. The apparatus of claim 11, the target business operation to represent execution of a marketing strategy for the user, the other business operation to represent non-execution of a marketing strategy for the user.

13. The apparatus of claim 10, wherein the first and second electrodes are disposed on opposite sides of the substrate,

the user selection module is specifically configured to: sorting the operation reward values respectively corresponding to all the users in the user group to be selected; and selecting part of users in the user group to be selected as the target users according to the sorting result.

14. The apparatus of any of claims 10 to 13, further comprising: and the network training module is used for training the deep neural network as the strategy decision network in a reinforcement learning mode.

15. The apparatus of claim 14, the network training module, comprising:

a sample obtaining sub-module, configured to obtain a training sample set, where each training sample in the training sample set includes: the user characteristics of the sample user and the response value after the target business operation is executed on the sample user;

the prediction output sub-module is used for respectively inputting the user characteristics of each sample user into a strategy decision network to be trained to obtain an operation reward value of the target business operation corresponding to the sample user, which is predicted and output by the strategy decision network;

the screening processing sub-module is used for selecting a plurality of sample users with operation reward values meeting screening conditions from a training sample set as a target sample user set according to the operation reward values of all the sample users, and the rest users in the training sample set are used as a non-target sample user set;

a reward processing sub-module for determining a cumulative gain difference between the set of target sample users and a set of non-target sample users according to the response value; returning the accumulated revenue difference to a policy decision network as a reward value;

and the parameter adjusting submodule is used for adjusting the network parameters of the strategy decision network according to the reward value so as to obtain the updated operation reward value in the next round of training.

16. The apparatus of claim 15, the reward processing sub-module, when configured to determine, from the response value, a cumulative gain difference between the set of target sample users and a set of non-target sample users, comprising:

17. An electronic device comprising a memory, a processor, the memory for storing computer instructions executable on the processor; the processor is configured to implement the method of selection of a target user of any one of claims 1 to 9 when executing the computer instructions.