CN109711871B

CN109711871B - Potential customer determination method, device, server and readable storage medium

Info

Publication number: CN109711871B
Application number: CN201811526942.8A
Authority: CN
Inventors: 盛名扬; 陆子龙
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2021-03-12
Anticipated expiration: 2038-12-13
Also published as: CN109711871A

Abstract

The application relates to a potential customer determination method, a potential customer determination device, a potential customer determination server and a readable storage medium. The method comprises the following steps: obtaining status information of a product in the platform; the state information includes: a first profit value brought to the platform by the product, a second profit value brought to a platform user of the platform by the product and a third profit value brought to a product user using the product by the product; inputting the state information and a plurality of action information of a user to be analyzed into a deep reinforcement learning model obtained by pre-training to obtain a long-term feedback estimation value corresponding to each action information; and determining whether the user to be analyzed is a potential user of the product or not according to the action corresponding to the maximum estimation value output by the deep reinforcement learning model. In this way, the potential customers can be determined through the deep reinforcement learning model, so that the efficiency of determining the potential customers is improved, and the labor cost can be reduced.

Description

Potential customer determination method, device, server and readable storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, a server, and a readable storage medium for determining a potential customer.

Background

In order to promote products (such as fan-headed strips) in a live platform better, potential customers of the products often need to be determined, so that potential users can be attracted to use the products in a targeted manner. Wherein a potential user generally refers to a customer who has an intent to purchase but has not yet become a user of the product.

Currently, the way to determine potential customers of a product is: the staff member artificially determines which users are potential customers according to past experience. However, this way of determining potential customers requires a relatively high experience of the staff; moreover, the technician needs to analyze a large number of users to identify potential customers. That is, this manner of identifying potential customers is inefficient and requires a significant expenditure of human labor.

Disclosure of Invention

To overcome the problems in the related art, the present application provides a potential customer determination method, apparatus, server, and readable storage medium, so as to improve the efficiency of determining potential customers and reduce the labor cost.

According to a first aspect of embodiments of the present application, there is provided a potential user determination method, including:

obtaining status information of a product in the platform; the state information includes: a first profit value brought to the platform by the product, a second profit value brought to a platform user of the platform by the product and a third profit value brought to a product user using the product by the product;

inputting the state information and a plurality of action information of a user to be analyzed into a deep reinforcement learning model obtained by pre-training to obtain a long-term feedback estimation value corresponding to each action information; each action information at least comprises: the method comprises the steps that characteristic information of a user to be analyzed and an action identifier are obtained, wherein the action identifier is an identifier of an ordering action aiming at a product or an identifier of a quitting ordering action; and determining whether the user to be analyzed is a potential user of the product or not according to the action corresponding to the maximum estimation value output by the deep reinforcement learning model.

Optionally, in an embodiment of the present application, the deep reinforcement learning model includes a deep Q network model.

Optionally, in this embodiment of the application, before the step of inputting the state information and the plurality of pieces of motion information of the user to be analyzed into the deep reinforcement learning model trained in advance, the method may further include:

constructing a Markov decision process model; wherein the Markov decision process model is: { S, A, R, T }; s represents the state information of the product, A represents the action information of the platform user for the action executed by the product, R represents the reward function, and T represents the state transition function;

obtaining a plurality of training samples based on a Markov decision process model; wherein, each training sample comprises: the system comprises historical state information of a product, action information of an action executed by a target user in platform users aiming at the product under the state information, an instant reward value obtained after the target user executes the target action in the action information, and next state information corresponding to the state information after the target action is executed; the target action is as follows: a subscription action or a forgoing subscription action;

optimizing parameters of the initial Q function by using the training sample to obtain a trained deep Q network model; the deep neural network corresponding to the initial Q function consists of two convolutional layers and two fully-connected layers; the parameters include: learning rate, discount factor, and Q value.

Optionally, in this embodiment of the present application, optimizing parameters of the initial Q function by using a training sample, to obtain a trained deep Q network model includes:

and optimizing the parameters of the initial Q function by using a training sample and a greedy algorithm epsilon-greedy algorithm to obtain a trained depth Q network model.

Optionally, in this embodiment of the application, the instant reward value output by the reward function is a value corresponding to the ordering action (the revenue value increased by the first positive platform + the revenue value increased by the second positive platform user + the revenue value increased by the third positive platform user + the revenue value increased by the user to be analyzed) + a value corresponding to the abandoning ordering action is a first negative number; and the value corresponding to the ordering action is 1-the value corresponding to the ordering action is abandoned.

Optionally, in this embodiment of the present application, the identification of the ordering action for the product includes: one or more of a first identification to perform a subscription action based on the user recommendation, a second identification to perform the subscription action based on the privacy recommendation, a third identification to perform the subscription action based on the coupon activity, and a fourth identification to perform the subscription action through a subscription portal in the platform.

Optionally, in this embodiment of the present application, the feature information of the user to be analyzed includes:

and one or more items of account information, the number of fans, the number of live broadcast works and the type of preference works of the user to be analyzed.

According to a second aspect of embodiments of the present application, there is provided a potential user determination apparatus, the apparatus comprising:

a first obtaining module configured to obtain status information of a product in a platform; the state information includes: a first profit value brought to the platform by the product, a second profit value brought to a platform user of the platform by the product and a third profit value brought to a product user using the product by the product;

the input module is configured to input the state information and the plurality of pieces of action information of the user to be analyzed into a depth reinforcement learning model obtained through pre-training to obtain an estimated value of long-term feedback corresponding to each piece of action information; each action information at least comprises: the method comprises the steps that characteristic information of a user to be analyzed and an action identifier are obtained, wherein the action identifier is an identifier of an ordering action aiming at a product or an identifier of a quitting ordering action;

and the determining module is configured to determine whether the user to be analyzed is a potential user of the product according to the action corresponding to the maximum estimation value output by the deep reinforcement learning model.

Optionally, in an embodiment of the present application, the apparatus further includes:

the system comprises a building module, a state information acquisition module and a state information analysis module, wherein the building module is configured to build a Markov decision process model before inputting state information and a plurality of action information of a user to be analyzed into a depth reinforcement learning model obtained by training in advance; wherein the Markov decision process model is: { S, A, R, T }; s represents the state information of the product, A represents the action information of the platform user for the action executed by the product, R represents the reward function, and T represents the state transition function;

a second obtaining module configured to obtain a plurality of training samples based on a Markov decision process model; wherein, each training sample comprises: the system comprises historical state information of a product, action information of an action executed by a target user in platform users aiming at the product under the state information, an instant reward value obtained after the target user executes the target action in the action information, and next state information corresponding to the state information after the target action is executed; the target action is as follows: a subscription action or a forgoing subscription action;

the optimization module is configured to optimize parameters of the initial Q function by using the training samples to obtain a trained deep Q network model; the deep neural network corresponding to the initial Q function consists of two convolutional layers and two fully-connected layers; the parameters include: learning rate, discount factor, and Q value.

Optionally, in this embodiment of the present application, the optimization module is specifically configured to:

According to a third aspect of embodiments of the present application, there is provided a server, including:

a processor, a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method steps of any one of the potential user determination methods of the first aspect described above.

According to a fourth aspect of embodiments herein, there is provided a readable storage medium, wherein instructions, when executed by a processor of a server, enable the server to perform the method steps of any one of the potential user determination methods of the first aspect.

According to a fifth aspect of embodiments herein, there is provided a computer program product which, when run on a server, causes the server to perform: method steps of any of the potential user determination methods of the first aspect described above.

In an embodiment of the present application, a server may obtain status information of products in a platform. The status information may include: the first profit value brought to the platform by the product, the second profit value brought to the platform user by the product and the third profit value brought to the product user using the product by the product. Then, the state information and the plurality of action information of the user to be analyzed are input into a deep reinforcement learning model obtained through pre-training, and an estimated value of long-term feedback corresponding to each action information is obtained. Wherein each action information at least comprises: the characteristic information of the user to be analyzed and an action identifier, wherein the action identifier is an identifier of an ordering action for the product or an identifier of a quitting ordering action. And then, determining whether the user to be analyzed is a potential user of the product according to the action corresponding to the maximum estimation value output by the deep reinforcement learning model.

Because the deep reinforcement learning model can establish the optimal mapping relation between the state information and the action information, the server can determine the optimal action corresponding to the current state information of the product through the deep reinforcement learning model, namely, the optimal action of the user to be analyzed on the product can be determined through the deep reinforcement learning model. Further, when the action is determined to be a subscription action, then the user to be analyzed may be determined to be a potential user. In this way, the potential customers can be determined through the deep reinforcement learning model, so that the efficiency of determining the potential customers is improved, and the labor cost can be reduced. Moreover, the potential user determining method can determine the potential user under the condition of ensuring the benefits of the platform, the platform user and the product user.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart illustrating a method of potential user determination according to an example embodiment.

FIG. 2 is a flow chart illustrating a calculation of Q according to an exemplary embodiment.

FIG. 3 is a schematic diagram illustrating the structure of a deep neural network in accordance with an exemplary embodiment.

Fig. 4 is a block diagram illustrating a potential user determination device in accordance with an example embodiment.

FIG. 5 is a block diagram illustrating a server in accordance with an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

In order to solve the problems that the method for determining the potential customer is inefficient and requires a large amount of labor cost in the prior art, embodiments of the present application provide a potential customer determination method, apparatus, server, and computer-readable storage medium.

The following first explains a potential customer determination method provided in an embodiment of the present application.

FIG. 1 is a flow diagram illustrating a method of potential customer determination in accordance with an exemplary embodiment. The potential customer determination method is applied to a server, and as shown in fig. 1, the method comprises the following steps:

s101: obtaining status information of a product in the platform; the state information includes: a first profit value brought to the platform by the product, a second profit value brought to a platform user of the platform by the product and a third profit value brought to a product user using the product by the product;

the platform in the embodiment of the application can be a live broadcast platform, and the product can be a vermicelli product in the live broadcast platform, but is not limited to the vermicelli product.

It will be appreciated that, in one implementation, the product status information obtained by the server may include: the first profit value brought to the platform by the product, the second profit value brought to the platform user by the product and the third profit value brought to the product user using the product by the product.

The first profit value brought to the platform by the product can be obtained by calculating the number of product users and the product price. For example, the first revenue value is the number of product users and the product price.

The second profit value brought to the platform user by the product can be obtained by the following method: and the server counts the added value of the duration of the platform used by the platform user. The incremental value is then quantified as a second revenue value that the product will provide to the platform user. For example, the second benefit value is an increase in the length of time that the platform user uses the platform. Wherein the first benefit factor may represent: the potential profit value from the increased value per unit time length.

The third profit value of the product to the product user using the product can be calculated as follows: and the server counts the vermicelli increment of the product user and the number of newly added works of the product user. And then, quantifying the fan increment and the number of newly added works into a third profit value brought to a product user by the product. For example, the third profit margin is the fan increase of the product user, the second profit coefficient + the number of newly added works of the product user, the third profit coefficient. Wherein the second benefit factor may represent: for each potential profit value brought by adding one fan, the third profit coefficient can represent: each increase in the potential revenue value of a work.

In another implementation manner, the state information may further include, on the basis of the content, that: one or more of the total number of platform users, the order distribution of the product (e.g., the number of users who purchase 10 orders in the product and the number of users who purchase 20 orders in the product), and the total placed volume of the product (e.g., the total exposure of the product to the live work) are reasonable.

S102: inputting the state information and a plurality of action information of a user to be analyzed into a deep reinforcement learning model obtained by pre-training to obtain a long-term feedback estimation value corresponding to each action information; each action information at least comprises: the method comprises the steps that characteristic information of a user to be analyzed and an action identifier are obtained, wherein the action identifier is an identifier of an ordering action aiming at a product or an identifier of a quitting ordering action;

the deep reinforcement learning model includes a deep Q network model (i.e., DQN model), but is not limited thereto.

The characteristic information of the user to be analyzed may include: and one or more items of account information, the number of fans, the number of live works and the preference work types of the user to be analyzed on the platform. Of course, it is reasonable to include the enbedding feature obtained by pre-training.

Also, in one implementation, the action identification may be an identification of an order action for the product or an identification of a forgoing order action. In this implementation, the identification of the ordering action for the product corresponds to one action information, and the identification of the forgoing ordering action for the product corresponds to another action information. In this way, after the state information of the product and the two pieces of action information are input into the trained deep reinforcement learning model, the estimation value of the long-term feedback corresponding to each piece of action information can be output.

In another implementation, the identification of the ordering action for the product may further include: one or more of a first identification to perform a subscription action based on the user recommendation, a second identification to perform the subscription action based on the privacy recommendation, a third identification to perform the subscription action based on the coupon activity, and a fourth identification to perform the subscription action through a subscription portal in the platform. Wherein, when the identification of the ordering action for the product comprises: the ordering method comprises the steps that when the first identification, the second identification, the third identification and the fourth identification are used, the first identification, the second identification, the third identification and the fourth identification correspond to one piece of action information respectively, and the identification for the ordering abandoning action of the product corresponds to one piece of action information, namely five pieces of action information. Then, after the state information of the product and the five pieces of motion information are input to the trained deep reinforcement learning model, the deep reinforcement learning model may output an estimated value of the long-term feedback corresponding to each piece of motion information.

The deep reinforcement learning model includes a deep Q network model (i.e., DQN model), but is not limited thereto. When the deep reinforcement learning model is the DQN model, as shown in fig. 2, after the state information of the product and the five motion information are input to the trained DQN model, the DQN model may output a Q value corresponding to each motion information, i.e., Q1, Q2, Q3, Q4, and Q5.

Additionally, it is understood that the server may build a markov decision process model prior to performing step S102. A plurality of training samples may then be obtained based on the markov decision process model.

When the constructed Markov decision process model is: { S, A, R, T }, each training sample includes: the system comprises historical state information of a product, action information of actions performed by target users in platform users aiming at the product under the state information, an instant reward value obtained after the target users perform the target actions in the action information, and next state information corresponding to the state information after the target actions are performed. Wherein the target action is: a subscribe action or a forgo subscribe action.

The historical status information may be set according to the setting mode of the status information of the product in step S101, which is not described herein again.

In addition, R ═ R (s, a, s '), R indicates an instant prize value obtained when action a is executed in a state corresponding to state information s and the state is shifted to a state corresponding to state information s'. T ═ T (s, a, s '), T denotes the probability that action a is performed on state s and transitions to state s'. In addition, according to the deep reinforcement learning related art, the state transition corresponding to the state information s is determined by the action taken under the state information.

In one example of the present application, it may be provided that: the instant reward value output by the reward function is the value corresponding to the ordering action (the profit value increased by the first positive platform + the profit value increased by the second positive platform user + the third positive value + the profit value increased by the user to be analyzed) + the value corresponding to the ordering action abandoning ═ the first negative number. And the value corresponding to the ordering action is 1-the value corresponding to the ordering action is abandoned. Of course, the design of the reward function is not limited. The values of the first positive number, the second positive number, the third positive number and the first negative number may be set according to actual conditions, and are not specifically limited herein.

In addition, when the deep reinforcement learning model to be trained is the DQN model, after the training samples are obtained, the server can also optimize the parameters of the initial Q function by using the training samples to obtain the trained DQN model. The deep neural network corresponding to the initial Q function may be composed of two convolutional layers and two fully-connected layers shown in fig. 3. The parameters include: learning rate, discount factor, and Q value. The DQN model obtained by training stores the learned knowledge, and can be used as a mapping relation between state information and optimal actions.

Specifically, Q (S, a) can be defined as Q value in the original state, Q (S ', a) is Q value after S is converted into S' after acting by a, W represents forward propagation of the deep neural network, and then:

q (S', a) ═ W (S, a, characteristic information of the user to be analyzed)

When the network W receives the original state S, the prime mover a and the feature information of the user to be analyzed as input (refer to a), the model optimization function is:

Q(S，A)←Q(S，A)+α[R+γmax_αQ(S’，a)-Q(S，A)]；

S←S’；

and circularly iterating the steps until the S converges.

In this strategy, the action a value that maximizes Q (S', a) is solved for the demand, here using the greedy algorithm ε -greedy algorithm:

a＝argmax_aq (a), probability 1- ε;

a, randomly selecting an action with a probability epsilon;

wherein the algorithm can achieve the explore-explore balance by adjusting the probability threshold epsilon.

The initial Q function is a function in the DQN-related technique, and the learning rate, the discount factor and the Q value are parameters in the DQN-related technique, which are not described in detail herein.

In addition, after the trained DQN model is obtained, the new training sample can be used for carrying out parameter fine tuning on the DQN model, so that the DQN model is updated. It is reasonable that the update cycle (e.g. 1 week) of the DQN model can be adjusted according to specific requirements, so that the DQN model has better extensibility and robustness, and thus the DQN model can more accurately determine whether a user is a potential customer.

S103: and determining whether the user to be analyzed is a potential user of the product or not according to the action corresponding to the maximum estimation value output by the deep reinforcement learning model.

When the action corresponding to the maximum estimation value output by the deep reinforcement learning model is an ordering action, the user to be analyzed can be determined to be a potential user, and product recommendation information (such as advertisements) can be sent to the user to be analyzed, so that the potential user is converted into an actual user, and the estimation value of the obtained long-term feedback is maximized. Furthermore, when the action corresponding to the maximum estimation value output by the deep reinforcement learning model is: when the ordering action corresponding to the third identifier of the ordering action is executed based on the coupon activity, the coupon activity information can be sent to the user to be analyzed, so that the advertisement can be accurately touched to the user, and the conversion rate of converting the potential user into the actual user is improved. In addition, when the action corresponding to the maximum estimation value output by the deep reinforcement learning model is the order abandoning action, it can be determined that the user to be analyzed is not a potential user.

In addition, the estimation value of the long-term feedback corresponding to the motion information is: an estimated value of the long-term feedback obtained after the action corresponding to the action information is executed, and therefore, when the estimated value of the long-term feedback is larger, the estimated value of the long-term feedback is more consistent with the expectation to be achieved: the sum of the income of the platform, the platform users and the product users is maximized.

Moreover, the deep reinforcement learning model not only optimizes the short-term click benefit (i.e. the instant reward value), but also captures the long-term benefit index (i.e. the estimated value of the long-term feedback). Therefore, by applying the potential user determination method provided by the embodiment of the application, the ordering behavior can bring the improvement of the long-term income index.

In conclusion, by applying the potential user determination method provided by the embodiment of the application, the potential customers can be determined through the deep reinforcement learning model, so that the efficiency of determining the potential customers is improved, and the labor cost can be reduced. Moreover, potential users can be determined under the condition that the benefits of the platform, platform users and product users are guaranteed.

Corresponding to the foregoing method embodiment, an embodiment of the present application further provides a potential user determining apparatus, and referring to fig. 4, the apparatus includes:

a first obtaining module 401 configured to obtain status information of a product in a platform; the state information includes: a first profit value brought to the platform by the product, a second profit value brought to a platform user of the platform by the product and a third profit value brought to a product user using the product by the product;

an input module 402, configured to input the state information and the plurality of action information of the user to be analyzed into a deep reinforcement learning model obtained through pre-training, so as to obtain an estimated value of long-term feedback corresponding to each action information; each action information at least comprises: the method comprises the steps that characteristic information of a user to be analyzed and an action identifier are obtained, wherein the action identifier is an identifier of an ordering action aiming at a product or an identifier of a quitting ordering action;

and the determining module 403 is configured to determine whether the user to be analyzed is a potential user of the product according to the action corresponding to the maximum estimation value output by the deep reinforcement learning model.

By applying the device provided by the embodiment of the application, the server can obtain the state information of the product in the platform. The status information may include: the first profit value brought to the platform by the product, the second profit value brought to the platform user by the product and the third profit value brought to the product user using the product by the product. Then, the state information and the plurality of action information of the user to be analyzed are input into a deep reinforcement learning model obtained through pre-training, and an estimated value of long-term feedback corresponding to each action information is obtained. Wherein each action information at least comprises: the characteristic information of the user to be analyzed and an action identifier, wherein the action identifier is an identifier of an ordering action for the product or an identifier of a quitting ordering action. And then, determining whether the user to be analyzed is a potential user of the product according to the action corresponding to the maximum estimation value output by the deep reinforcement learning model.

Fig. 5 is a block diagram illustrating an apparatus 1900 for implementing the determination of potential users in accordance with an example embodiment. For example, the apparatus 1900 may be provided as a server. Referring to FIG. 5, the device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform method steps of any of the potential user determination methods described above.

The device 1900 may also include a power component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

Corresponding to the above method embodiment, the present application further provides a readable storage medium, and when executed by a processor of a server, the instructions in the storage medium enable the server to perform the method steps of any one of the above potential user determination methods. Wherein the readable storage medium is a computer readable storage medium.

After the computer program stored in the readable storage medium provided by the embodiment of the application is executed by the processor of the server, the server can obtain the state information of the product in the platform. The status information may include: the first profit value brought to the platform by the product, the second profit value brought to the platform user by the product and the third profit value brought to the product user using the product by the product. Then, the state information and the plurality of action information of the user to be analyzed are input into a deep reinforcement learning model obtained through pre-training, and an estimated value of long-term feedback corresponding to each action information is obtained. Wherein each action information at least comprises: the characteristic information of the user to be analyzed and an action identifier, wherein the action identifier is an identifier of an ordering action for the product or an identifier of a quitting ordering action. And then, determining whether the user to be analyzed is a potential user of the product according to the action corresponding to the maximum estimation value output by the deep reinforcement learning model.

Corresponding to the above method embodiment, this application embodiment also provides a computer program product, which when run on a server, causes the server to perform: method steps of any one of the above information push methods.

After the computer program product provided by the embodiment of the application is executed by the processor of the server, the server can obtain the state information of the product in the platform. The status information may include: the first profit value brought to the platform by the product, the second profit value brought to the platform user by the product and the third profit value brought to the product user using the product by the product. Then, the state information and the plurality of action information of the user to be analyzed are input into a deep reinforcement learning model obtained through pre-training, and an estimated value of long-term feedback corresponding to each action information is obtained. Wherein each action information at least comprises: the characteristic information of the user to be analyzed and an action identifier, wherein the action identifier is an identifier of an ordering action for the product or an identifier of a quitting ordering action. And then, determining whether the user to be analyzed is a potential user of the product according to the action corresponding to the maximum estimation value output by the deep reinforcement learning model.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, server, computer-readable storage medium, and computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and for related matters, reference may be made to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A method for potential user determination, the method comprising:

obtaining status information of a product in the platform; the state information includes: a first profit value brought to the platform by the product, a second profit value brought to a platform user of the platform by the product, and a third profit value brought to a product user using the product by the product;

inputting the state information and a plurality of action information of the user to be analyzed into a deep reinforcement learning model obtained by pre-training to obtain a long-term feedback estimation value corresponding to each action information; each action information at least comprises: the characteristic information of the user to be analyzed and an action identifier are the identifier of the ordering action of the product or the identifier of the abandoning ordering action, wherein the deep reinforcement learning model obtained by pre-training is as follows: optimizing parameters of an initial Q function for a sample to obtain a trained deep Q network model, wherein the parameters comprise historical state information of the product, action information of a target user in the platform user aiming at an action executed by the product under the state information, an instant reward value obtained after the target user executes a target action in the action information, and next state information corresponding to the state information after the target action is executed, and the parameters comprise: learning rate, discount factor, and Q value;

and determining whether the user to be analyzed is a potential user of the product according to the action corresponding to the maximum estimation value output by the deep reinforcement learning model.

2. The method of claim 1, wherein the deep reinforcement learning model comprises a deep Q network model.

3. The method of claim 2, wherein prior to the step of inputting the state information and the plurality of motion information of the user to be analyzed into a pre-trained deep reinforcement learning model, the method further comprises:

constructing a Markov decision process model; wherein the Markov decision process model is: { S, A, R, T }; the S represents state information of the product, the A represents action information of an action performed by the platform user on the product, the R represents a reward function, and the T represents a state transition function;

obtaining a plurality of training samples based on the Markov decision process model; wherein, each training sample comprises: the system comprises historical state information of the product, action information of a target user in the platform users on actions performed on the product under the state information, an instant reward value obtained after the target user performs a target action in the action information, and next state information corresponding to the state information after the target action is performed; the target action is as follows: a subscription action or a forgoing subscription action;

4. The method of claim 3, wherein the optimizing the parameters of the initial Q function using the training samples to obtain the trained deep Q network model comprises:

and optimizing the parameters of the initial Q function by using the training sample and a greedy algorithm epsilon-greedy algorithm to obtain a trained depth Q network model.

5. The method according to claim 3, wherein the instant reward value output by the reward function is a value corresponding to an order action (a first positive number, a platform added benefit value + a second positive number, a platform user added benefit value + a third positive number, the user added benefit value to be analyzed) + a value corresponding to a forgoing order action, a first negative number; and the value corresponding to the ordering action is 1-the value corresponding to the ordering action is abandoned.

6. The method of claim 1, wherein the identification of the ordering action for the product comprises: one or more of a first identification to perform a subscription action based on a user recommendation, a second identification to perform a subscription action based on a privacy recommendation, a third identification to perform a subscription action based on coupon activity, and a fourth identification to perform a subscription action through a subscription portal in the platform.

7. The method according to any one of claims 1-6, wherein the characteristic information of the user to be analyzed comprises:

8. A potential user determination apparatus, the apparatus comprising:

a first obtaining module configured to obtain status information of a product in a platform; the state information includes: a first profit value brought to the platform by the product, a second profit value brought to a platform user of the platform by the product, and a third profit value brought to a product user using the product by the product;

the input module is configured to input the state information and the plurality of pieces of action information of the user to be analyzed into a deep reinforcement learning model obtained through pre-training to obtain an estimated value of long-term feedback corresponding to each piece of action information; each action information at least comprises: the characteristic information of the user to be analyzed and an action identifier are the identifier of the ordering action of the product or the identifier of the abandoning ordering action, wherein the deep reinforcement learning model obtained by pre-training is as follows: optimizing parameters of an initial Q function for a sample to obtain a trained deep Q network model, wherein the parameters comprise historical state information of the product, action information of a target user in the platform user aiming at an action executed by the product under the state information, an instant reward value obtained after the target user executes a target action in the action information, and next state information corresponding to the state information after the target action is executed, and the parameters comprise: learning rate, discount factor, and Q value;

and the determining module is configured to determine whether the user to be analyzed is a potential user of the product according to the action corresponding to the maximum estimation value output by the deep reinforcement learning.

9. The apparatus of claim 8, wherein the deep reinforcement learning model comprises a deep Q network model.

10. The apparatus of claim 9, further comprising:

a building module configured to build a Markov decision process model before inputting the state information and a plurality of action information of a user to be analyzed into a deep reinforcement learning model trained in advance; wherein the Markov decision process model is: { S, A, R, T }; the S represents state information of the product, the A represents action information of an action performed by the platform user on the product, the R represents a reward function, and the T represents a state transition function;

a second obtaining module configured to obtain a plurality of training samples based on the Markov decision process model; wherein, each training sample comprises: the system comprises historical state information of the product, action information of a target user in the platform users on actions performed on the product under the state information, an instant reward value obtained after the target user performs a target action in the action information, and next state information corresponding to the state information after the target action is performed; the target action is as follows: a subscription action or a forgoing subscription action;

11. The apparatus of claim 10, wherein the optimization module is specifically configured to:

12. The apparatus according to claim 10, wherein the reward function outputs an instant reward value (first positive value + platform added benefit value + second positive value + platform user added benefit value + third positive value + user added benefit value + to-be-analyzed user added benefit value) + abandoning the value corresponding to the ordering action (first negative value); and the value corresponding to the ordering action is 1-the value corresponding to the ordering action is abandoned.

13. The apparatus of claim 8, wherein the identification of the ordering action for the product comprises: one or more of a first identification to perform a subscription action based on a user recommendation, a second identification to perform a subscription action based on a privacy recommendation, a third identification to perform a subscription action based on coupon activity, and a fourth identification to perform a subscription action through a subscription portal in the platform.

14. The apparatus according to any one of claims 8-13, wherein the characteristic information of the user to be analyzed comprises:

15. A server, comprising:

a processor, a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of any one of claims 1 to 7.

16. A readable storage medium in which instructions, when executed by a processor of a server, enable the server to perform the method of any one of claims 1 to 7.