CN111046156A

CN111046156A - Method and device for determining reward data and server

Info

Publication number: CN111046156A
Application number: CN201911199043.6A
Authority: CN
Inventors: 张琳; 梁忠平
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-04-21
Anticipated expiration: 2039-11-29
Also published as: CN111046156B

Abstract

The specification provides a method, a device and a server for determining reward data. In one embodiment, the method for determining the reward data comprises the steps of firstly obtaining click state data of a first sample user for a current tag, and determining current action strategy data by a preset questioning model according to the click state data of the first sample user for the current tag; and determining reward data for reinforcement learning, which are fed back to a preset questioning model, by calling a preset reward model trained in advance according to the click state data of the first sample user for the current label and the current action strategy data. Therefore, the reward data for reinforcement learning can be rapidly and accurately acquired.

Description

Method and device for determining reward data and server

Technical Field

The specification belongs to the technical field of internet, and particularly relates to a method, a device and a server for determining reward data.

Background

In many scenarios (e.g., customer service response scenario of APP), in order to improve the user experience, a pre-trained model is often used to automatically predict the specific questions that the user wants to ask based on the collected behavior data of the user (e.g., clicking operations of the user on the displayed sets of labels). And aiming at the question, searching in time and feeding back a corresponding answer to the user.

Wherein, the model is usually obtained by reinforcement learning. In the process of training the relevant model through reinforcement learning, proper reward data is required to be fed back to the model, so that the model can be continuously guided by the reward data to find a better processing strategy to predict a target problem which a user wants to ask.

Therefore, a method of obtaining reward data for reinforcement learning is needed.

Disclosure of Invention

The specification provides a method, a device and a server for determining reward data so as to quickly and accurately acquire the reward data for reinforcement learning.

The method, the device and the server for determining the reward data are realized as follows:

a method of determining reward data, comprising: acquiring click state data of a first sample user for a current tag, and current action strategy data determined by a preset question model according to the click state data of the first sample user for the current tag, wherein the current action strategy data comprise: clicking operation of a first sample user for a next group of labels, or proposing a target problem by the first sample user; and calling a preset reward model to determine reward data fed back to the preset question model according to the click state data of the first sample user aiming at the current label and the current action strategy data.

A method of determining reward data, comprising: acquiring current state data and current action strategy data determined by a preset processing model according to the current state data; and calling a preset reward model to determine reward data fed back to the preset processing model according to the current state data and the current action strategy data.

A reward data determination apparatus comprising: the obtaining module is configured to obtain click state data of a first sample user for a current tag, and current action policy data determined by a preset question model according to the click state data of the first sample user for the current tag, where the current action policy data includes: clicking operation of a first sample user for a next group of labels, or proposing a target problem by the first sample user; and the determining module is used for calling a preset reward model to determine reward data fed back to the preset questioning model according to the click state data of the first sample user aiming at the current label and the current action strategy data.

A server comprises a processor and a memory for storing processor executable instructions, wherein the processor realizes acquisition of click state data of a first sample user for a current tag when executing the instructions, and current action strategy data determined by a preset questioning model according to the click state data of the first sample user for the current tag, wherein the current action strategy data comprises: clicking operation of a first sample user for a next group of labels, or proposing a target problem by the first sample user; and calling a preset reward model to determine reward data fed back to the preset question model according to the click state data of the first sample user aiming at the current label and the current action strategy data.

A computer-readable storage medium having stored thereon computer instructions, which when executed, implement obtaining click status data of a first sample user for a current tag, and current action policy data determined by a preset question model according to the click status data of the first sample user for the current tag, wherein the current action policy data includes: clicking operation of a first sample user for a next group of labels, or proposing a target problem by the first sample user; and calling a preset reward model to determine reward data fed back to the preset question model according to the click state data of the first sample user aiming at the current label and the current action strategy data.

According to the method, the device and the server for determining the reward data, click state data of a first sample user for a current label are obtained firstly, and current action strategy data determined by a preset questioning model according to the click state data of the first sample user for the current label are obtained; and determining reward data fed back to the preset questioning model according to the click state data of the first sample user for the current label and the current action strategy data by calling the preset reward model trained in advance, and using the reward data to perform reinforcement learning on the preset questioning model. Therefore, the reward data for reinforcement learning with better training effect can be quickly and accurately acquired.

Drawings

In order to more clearly illustrate the embodiments of the present specification, the drawings needed to be used in the embodiments will be briefly described below, and the drawings in the following description are only some of the embodiments described in the present specification, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a diagram illustrating an embodiment of a method for determining reward data provided by an embodiment of the present disclosure, in an example scenario;

FIG. 2 is a diagram illustrating an embodiment of a method for determining reward data provided by an embodiment of the present disclosure, in an example scenario;

FIG. 3 is a diagram illustrating an embodiment of a method for determining reward data provided by an embodiment of the present disclosure, in an example scenario;

FIG. 4 is a flow diagram illustrating a method for determining reward data provided by one embodiment of the present description;

FIG. 5 is a schematic diagram of obtaining a predetermined reward pattern provided by one embodiment of the present description;

FIG. 6 is a schematic structural component diagram of a server provided in an embodiment of the present description;

fig. 7 is a schematic structural diagram of a device for determining bonus data according to an embodiment of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.

The embodiment of the specification provides a method for determining reward data, and the method for determining reward data can be particularly applied to a server of a data processing system.

In specific implementation, the server may be specifically configured to obtain and utilize click operation data of a first sample user for multiple groups of tags and target questions of the first sample user, and train, through reinforcement learning, to obtain a preset question model that can predict, according to the click operation of the user for the multiple groups of tags, the target questions that the user wants to ask for questions, and meet requirements. In a specific reinforcement learning process, the server may first obtain click state data of a first sample user for a current tag, and current action policy data determined by a preset question model according to the click state data of the first sample user for the current tag, where the current action policy data includes: clicking operation of a first sample user for a next group of labels, or proposing a target problem by the first sample user; and then, calling a preset reward model to determine reward data fed back to the preset questioning model according to the click state data of the first sample user aiming at the current label and the current action strategy data. And then, the reward data obtained based on the mode can be continuously used for guiding the training of the preset questioning model to find a better strategy to accurately predict the target question which the user wants to put forward, so that the preset questioning model meeting the requirements is obtained.

In this embodiment, the server may specifically include a background service server that is applied to one side of the service platform and is capable of implementing functions such as data transmission and data processing. Specifically, the server may be an electronic device having data operation, storage function and network interaction function; or a software program running in the electronic device to support data processing, storage and network interaction. In the present embodiment, the number of servers is not particularly limited. The server may specifically be one server, or may also be several servers, or a server cluster formed by several servers.

In one scenario example, as shown in fig. 1, the method for determining reward data provided in the embodiment of the present disclosure may be applied to automatically determine suitable reward data, and then a preset questioning model meeting requirements may be obtained continuously through intensive training by using the reward data.

In the example of the scenario, the a network company plans to add an intelligent customer service answering function to a mobile phone APP of a certain treasure issued by the a network company, so as to timely and quickly answer a problem occurring when a user uses the APP. In order to improve the user experience, the a-network company wants to train a preset question model for the customer service response scenario, and through the model, the user can guide, collect and intelligently determine the question that the user wants to ask according to the behavior data of the user when asking the question without directly inputting the question that the user wants to ask, and then feed back the answer to the question for the user according to the determined question.

Specifically, referring to fig. 2, when the user clicks a customer service icon "my customer service" in a home page of a certain APP installed on the mobile phone, the user enters a customer service dialog interface in the "my customer service". At this time, the APP may sequentially present the user with sets of tags according to user attribute information (e.g., the user's gender, age, transaction record, academic history, etc.) of the currently logged-in user. The user can click one or more labels in each group of labels displayed on the mobile phone according to questions asked by the user. The APP can collect the clicking operation of the user aiming at the tags to serve as behavior data when the user asks questions, and the APP determines standard questions which the user wants to ask according to the behavior data and a preset question bank through a preset question asking model. For example, it is determined that the standard question that the user wants to ask is "how to query the transaction record". Further, the APP may search for answers matching the standard question to feed back to the user. As can be seen in particular in fig. 3.

The label may specifically include a name label of a related service, such as "transaction service" or the like; name tags for operations supported in the service, such as "query" and the like; name tags for operation-specific objects involved in the business, such as "trade orders," etc., may also be included. Of course, the above list of labels is merely illustrative. In specific implementation, other content or form tags may be included according to specific service scenarios and processing needs. The present specification is not limited to these.

In order to implement the above functions, a preset question model capable of predicting the next action of the user and a question that the user wants to ask is firstly trained according to the collected click operation of the user on the displayed label.

In the scenario example, the server may train and establish a preset questioning model meeting requirements by using sample data acquired in advance, for example, click operation data of a first sample user participating in a test for a plurality of groups of tags and a target problem provided by the first sample user, in a reinforcement learning manner.

Specifically, the server may first establish an initial policy model, and then predict, based on the established processing policy, a tag that is clicked by the first sample user next time according to the click operation data of the first sample user for the tag, and finally a target question that the first sample user wants to ask. Furthermore, the prediction result is fed back to reward data corresponding to the model, so that the processing strategy used by the model can be continuously optimized according to the obtained reward data after the initial strategy model predicts each time, more accurate processing strategies can be gradually learned and established, and the target problem to be proposed by the user can be predicted more accurately according to the clicking operation of the user on the label based on the processing strategies.

In specific implementation, the server may use the click status data of the first sample user a currently aiming at the presented first group of tags in the sample data (which may be denoted as S1): the user a clicks the tag 1 and the tag 2, and user characteristic data of the user a, such as gender, age, occupation, monthly income, etc. of the user a are input into the initial policy model, and the initial policy model is run, so that the initial policy model can predict the click operation of the user a on the next set of displayed tags (for example, the second set of tags) according to the current tag click state data of the user a and the user characteristic of the user a based on the currently owned processing policy, and the user a clicks the next set of displayed tags (for example, the second set of tags) by: the tab 4 and the tab 5 are clicked as action policy data (may be noted as a1) of the next step predicted for the click state data of the current tab of the user a.

Further, the server may determine corresponding reward data (which may be denoted as r1) according to the action policy data predicted by the original processing policy with respect to the initial policy model, so as to guide the initial policy model to continuously improve and optimize the used processing policy.

Specifically, the server may input the action policy data a1 of the user a predicted by the initial policy model and the click state data (S1) of the current tag of the user a as models into a preset reward model trained in advance and used for determining appropriate reward data, and run the preset reward model to obtain model outputs corresponding to a1 and S1, and the reward data for the state data S1 and the action policy data a1 adopted for the state data S1, which are fed back to the initial policy model, is recorded as r 1. Therefore, the subsequent initial strategy model can learn and adjust the processing strategy used before according to the reward data r1, and further can predict the action strategy data of the user more accurately by adopting a better processing strategy.

Further, the policy model of the above process may update the tag click state data of the user a according to the predicted action policy data a1, to obtain the tag click state data of the next user a (which may be denoted as S2): user a has clicked on tab 4 and tab 5. At this time, the server may input the newly updated tag click state data S2 as the current tag click state, and the user characteristics of the user a as the model input to the initial policy model, and run the policy model to predict the next action policy data (which may be referred to as a2) for the state: user a asks how to query the transaction record. That is, it is predicted how the target question that user a wants to ask is to query the transaction record.

Similarly, the server inputs the tag click state data S2 and the predicted corresponding action strategy data a2 as models, inputs the models into a preset reward model, and obtains corresponding model outputs by running the preset reward model, as reward data r2 for the state data S2 and the action strategy data a2 adopted for the state data S2, which are fed back to the initial strategy model.

Furthermore, the server can adjust and optimize the processing strategy used by the initial strategy model according to the reward data r1 and r2, thereby completing the reinforcement learning of the sample data of the first sample user A in the sample data.

According to the mode, the server can continuously adjust and optimize the processing strategy used by the strategy model for multiple times by using the sample data until the error of the target question asked by the first sample user finally determined by the strategy model based on the used strategy is relatively small, for example, the error is smaller than the preset error value, the strengthening training is completed, and the preset questioning model which is high in accuracy and meets the requirements is obtained.

Therefore, corresponding reward data can be set manually according to related knowledge and experience without depending on technicians, the conditions that the reward data set by the technicians is easily influenced by subjective factors (such as processing experience, knowledge background and the like) of the technicians, the reward data is discrete in numerical value, errors are easily caused and the like are avoided, the accuracy of the determined reward data is improved, and the efficiency of determining the reward data is improved.

In another example scenario, the server may first collect tag click operation data of a second sample user participating in the test and a target question of the second sample user that the second sample user wants to ask a question, and train the data to obtain a preset reward model that can more accurately determine appropriate reward data.

In this scenario, company a may organize a group of test users to click on multiple groups of tags displayed by APPs, respectively, to describe a target question that the test users want to ask, and finally explicitly input the target question that the test users want to ask. The server may collect the tag click operation of the user and the input target problem as click operation data of the second sample user for a plurality of groups of tags and a target problem of the second sample user. For example, it is collected that user B participating in the test clicked tab 2 and tab 3 in the first set of tabs presented, clicked tab 4 and tab 6 in the second set of tabs presented, and that user B entered the target question in the last dialog: how to query for sesame credits.

In specific implementation, the server may first invoke an initial policy model to perform reinforcement learning on the tag click data and the target question of the second sample user, and an experienced technician manually sets corresponding reward parameters for each initial preset questioning model according to corresponding preset reward rules, with respect to action policy data corresponding to the state data of the second sample user determined by the preset questioning model. Further, the server may obtain a plurality of reward parameters of a technician for a plurality of action policy data of the same second sample user, and calculate the jackpot according to the plurality of reward parameters. And constructing a target loss function according to the accumulated rewards. The objective loss function comprises a preset reward model. And then according to the target loss function, determining the corresponding model parameters in the preset reward model by solving the optimal value of the target loss function, thereby establishing the preset reward model.

In another scenario example, the server may further acquire a plurality of first reward parameters determined by the technician according to a preset reward rule and manually set for a plurality of action policy data determined based on the initial policy model. Meanwhile, the server can also establish an initial reward model, and determine a plurality of second reward parameters aiming at a plurality of action strategy data determined based on the initial strategy model by using the initial reward model. And the server can adjust the initial reward model for multiple times in a targeted manner according to the plurality of first reward parameters and the plurality of second reward parameters until the difference value between the first reward parameters and the second reward parameters is smaller than or equal to a preset difference threshold value, so that the preset reward model with high accuracy is obtained.

In another scenario example, the server may also gather historical training records when training the model historically through reinforcement learning. And extracting sample state data, sample action strategy data and sample reward data adopted corresponding to the sample state data and the sample action strategy data from the historical training records. Further, the sample state data, the sample action strategy data, and the sample reward data, which correspond to each other, may be used as a set of training data. In the above manner, sets of training data for training the reward model may be obtained from historical training records. And establishing a preset reward model which can determine proper reward data according to the state data and the action strategy data corresponding to the state data through model learning of the multiple groups of training data.

Of course, it should be noted that the above-listed manner of obtaining the predetermined bonus model is only an exemplary illustration. In specific implementation, one of the predetermined reward models may be selected according to specific situations and processing requirements, or another suitable obtaining manner other than the above-listed obtaining manner may be used to obtain the predetermined reward model. Therefore, the description is not repeated.

Referring to fig. 4, the present specification provides a method for determining bonus data, wherein the method is specifically applied to a server side. In particular implementations, the method may include the following.

S401: acquiring click state data of a first sample user for a current tag, and current action strategy data determined by a preset question model according to the click state data of the first sample user for the current tag, wherein the current action strategy data comprise: and clicking operation of the first sample user for the next group of labels, or raising a target problem by the first sample user.

In some embodiments, the click status data of the first sample user for the current tag may specifically include: and clicking operation data of the users who participate in training the preset question model aiming at the current label in the displayed multiple groups of labels. For example, a first sample user clicked on tab 4 and tab 5 of the set of tabs for a second set of tabs currently being presented, but did not click on other tabs (e.g., tab 1, tab 2, and tab 3) of the set.

The tag may specifically include tag data that is associated with the question and can describe one or more related attribute features of the question. The preset question model may specifically include a model capable of predicting a next action of the user according to a click operation of the user on the displayed label, and a target question that the user wants to ask.

In some embodiments, in implementation, the server may first sequentially present a plurality of different sets of tags to the first sample user in sequence. For example, a first set of labels is displayed, followed by a second set of labels, and so on. The first sample user can respectively select and click the labels which are associated with the target question and can describe one or more related attribute characteristics of the target question in each group of displayed labels according to the target question to be asked. Meanwhile, the first sample user can input the target question which the user wants to ask according to the instruction. Therefore, the server can acquire the click operation data of the first sample user on the plurality of groups of labels and the target problem of the first sample user corresponding to the click operation data, and the target problem is used as sample data for subsequently training and establishing a preset question model.

In some embodiments, the server may perform reinforcement learning by using the collected sample data (including the click operation data of the first sample user for the tag and the target question of the first sample user) to establish a preset questioning model with high accuracy and meeting requirements.

In specific implementation, the server may first establish an initial policy model as a preset question model. The preset question model can randomly generate a corresponding processing strategy, and the next action data (which can be recorded as action strategy data) of the user is predicted according to the current label click state data of the user based on the processing strategy, wherein the next action data comprises labels which the user may click next, or target questions which the user wants to ask, and the like.

Specifically, the server may use the obtained click state data of the first sample user for the current tag (for example, the first sample user D selects and clicks tag 1 and tag 2 in the presented first group of tags, and does not click other tags in the group of tags) as a model input, and input the model input to the preset questioning model. The preset questioning model is operated, so that the preset questioning model can obtain model output corresponding to the click state data of the first sample user for the current tag according to the click state data of the first sample user for the current tag based on the owned processing strategy (for example, the first sample user D can select and click the tag 6 and the tag 7 in the second group of tags displayed next, but cannot click other tags in the group of tags), and the model output is used as corresponding current action strategy data, so that actions which are possibly taken by the sample user at a higher probability in the next step can be predicted.

However, since the preset question model is predicted based on a randomly generated processing strategy, which does not learn sample data yet, the accuracy of the processing strategy itself is not high, and the error is often relatively large when the action strategy data is predicted based on the processing strategy, so that the preset question model does not meet the requirements.

In this embodiment, in specific implementation, based on reinforcement learning, the preset question model may be used to predict corresponding action policy data according to the tag click state data of the first sample user based on the owned processing policy. And determining reward data by combining the corresponding label click state data of the first sample user according to the predicted action strategy data. And then the reward data can be used for guiding a preset question model to continuously optimize and improve the used processing strategy, so that the action strategy data of the user can be predicted more accurately on the basis of the optimized and improved processing strategy and the click state data of the user aiming at the label, and the target question to be asked by the user is determined.

The reward data may specifically include parameter data for guiding a learning direction of reinforcement learning. According to the reward data, in the reinforcement learning process, the model can automatically learn and improve towards the learning direction of a better processing strategy with relatively higher probability so as to continuously adjust and optimize the used processing strategy.

Usually, in the process of determining the reward data, a technician manually sets a proper reward value as reward data to feed back to the preset questioning model according to action strategy data predicted by the preset questioning model based on the label click state data of each user according to the used processing strategy based on the knowledge background and the processing experience of the technician.

The reward data determined in the above manner is often influenced by subjective factors of technicians (including knowledge backgrounds, processing experiences and the like of the technicians), so that the determined reward data is often not accurate and stable enough and is easy to generate errors. In addition, the specific reward data is set manually by a technician, so that the determined plurality of reward data are discrete and discontinuous in value, and the effect of guiding the optimization of the preset questioning model and improving the used processing strategy is not ideal.

In this embodiment, in a specific implementation, the server may use a pre-trained preset reward model to automatically and accurately determine appropriate reward data according to the click state data of the first sample user for the current tag, which is input to the preset questioning model each time, and the current action policy data determined by the preset questioning model based on the click state data for the current tag, without depending on the manual setting of reward data by a technician.

S403: and calling a preset reward model to determine reward data fed back to the preset question model according to the click state data of the first sample user aiming at the current label and the current action strategy data.

The preset reward model may specifically include a model which is trained in advance and can determine reward data fed back to the model for reinforcement learning training according to click state data of a user for a tag and action strategy data predicted according to the click state data.

In some embodiments, in a specific implementation, the server may input, as a group of models, the click state data of the first sample user for the currently displayed tag and the current action policy data corresponding to the click state data of the tag, which is determined by the preset questioning model according to the click state data of the first sample user for the currently displayed tag, into the preset reward model. And operating the preset reward model to obtain the model output corresponding to the model input of the group, and using the model output as the reward data of the processing strategy adopted by the preset questioning model according to the current action strategy data determined by the first sample user according to the click state data of the currently displayed label.

In some embodiments, after the reward data fed back to the preset questioning model is obtained in the above manner, the server may further perform reinforcement learning on the preset questioning model by using the reward data, so that the preset questioning model can continuously modify and optimize the used processing strategy according to the reward data. And then when the action strategy corresponding to the click state data of the same or similar label is predicted and judged by a subsequent preset questioning model, the corresponding action strategy data can be more accurately determined by adopting the modified and optimized processing strategy, and the model precision of the preset questioning model is improved.

As can be seen from the above, in the method for determining reward data provided in the embodiments of the present specification, click state data of a first sample user for a current tag is obtained first, and current action policy data determined by a preset question model according to the click state data of the first sample user for the current tag is obtained; and determining reward data fed back to the preset questioning model according to the click state data of the first sample user aiming at the current label and the current action strategy data by calling the preset reward model trained in advance, so as to perform reinforcement learning on the preset questioning model. Therefore, the reward data for reinforcement learning can be rapidly and accurately acquired.

In some embodiments, the preset reward model may be obtained specifically as follows: acquiring click operation data of a second sample user for multiple groups of labels and target problems of the second sample user as sample data; and learning the sample data to obtain a preset reward model.

In some embodiments, in implementation, the server may display a plurality of groups of tags to the second sample user participating in the test in advance, and guide the second sample user to describe a target question that the second sample user wants to ask by clicking on the related tags in each group, so that click operation data of the second sample user for the plurality of groups of tags may be collected and obtained. Meanwhile, the server can guide the user to input the target question to be asked at last or at the beginning, so that the server can simultaneously acquire the target question corresponding to the click operation data of the second sample user for the plurality of groups of labels. And combining the click operation data of the second sample user for the plurality of groups of labels and the target problem corresponding to the click operation data of the second sample user for the plurality of groups of labels to form a group of data. According to the mode, a plurality of groups of data can be obtained and used as sample data for training the preset reward model.

Furthermore, a preset reward model can be established and obtained by learning the sample data.

In specific implementation, the learning of the sample data to obtain the preset reward model may include the following steps: according to the sample data, determining a plurality of reward parameters of a plurality of action strategy data determined based on a preset question model; determining a jackpot based on the plurality of reward parameters; constructing a target loss function according to the accumulated rewards; and establishing the preset reward model according to the target loss function.

In this embodiment, in specific implementation, the server may first establish an initial policy model to be used as a preset question model, call the initial policy model to perform reinforcement learning on the sample data, and manually set, according to a corresponding preset reward rule, a corresponding reward value as a reward parameter for action policy data predicted by the initial policy model for click operation data of a second sample user in the sample data in each turn. Further, the server may collect reward parameters given by the technician for multiple rounds of predictions of the initial policy model to determine a corresponding jackpot. And further, a target loss function containing a preset reward model can be established according to the accumulated reward. And then, determining and establishing a corresponding preset reward model by solving the optimal value, such as the minimum value, of the target loss function.

In particular, the jackpot may be determined as follows:

wherein G is_tIn particular, it may be expressed as a jackpot, r_tSpecifically, the reward parameter γ may be set for the action strategy data predicted in the tth round of the click operation data in the click operation data of the second sample user in the sample data^kSpecifically, canSpecifically, γ may be represented as a discount factor, and T may be represented as a total number of rounds of motion strategy data prediction performed on click operation data of the second sample user in the sample data.

In this embodiment, a predetermined bonus model may be first recorded as R. Further, a goal loss function including a preset reward model R to be determined may be established according to the obtained accumulated reward.

Specifically, a corresponding objective loss function including a preset reward model may be established according to the cumulative reward in the following manner:

L₁(σ)＝(R(s_t,a_t；σ)-sigmoid(G_t))²

wherein, σ can be specifically expressed as a model parameter in a preset reward model, R can be specifically expressed as a preset reward model, s_tSpecifically, the click operation data (or click state data) of the second sample user input during the t-th round prediction of the preset questioning model can be expressed as a_tMay particularly be expressed as s-based_tThe Sigmoid () may be specifically expressed as a Sigmoid (sigma) activation function, L₁And may be specifically expressed as an objective loss function.

The Sigmoid (sigma) activation function may be specifically expressed as the following form:

after the target loss function is obtained, a preset reward obtaining model can be established according to the target loss function. Specifically, the optimal value (e.g., the minimum value) of the objective loss function may be continuously searched and solved to gradually calculate and determine each model parameter σ in the preset reward model R, so as to determine that the preset reward model is obtained.

Of course, it should be noted that the above-listed manner of obtaining the predetermined bonus model is only an exemplary illustration. In specific implementation, according to specific application scenarios and processing requirements, other suitable manners may also be adopted to establish and acquire the preset reward model.

In some embodiments, in implementation, the sample data may be learned in the following manner to establish a preset reward model.

Establishing an initial reward model; according to the sample data and a preset reward rule, determining a plurality of first reward parameters aiming at a plurality of action strategy data determined based on a preset question model; determining a plurality of second reward parameters aiming at a plurality of action strategy data determined based on a preset question model according to the initial reward model; and adjusting the initial reward model according to the first reward parameters and the second reward parameters to obtain the preset reward model.

In this embodiment, the first reward parameter may specifically be a reward value set by a technician according to a preset reward rule, manually with respect to a preset questioning model (for example, an initial policy model), according to action policy data predicted by the tag click operation data of the second sample user.

In this embodiment, the second incentive parameter may be an incentive value determined by an initial incentive model according to action policy data of a point predicted by the label click operation data of the same second sample user with respect to a preset questioning model.

Further, the server may compare the first reward parameter and the second reward parameter corresponding to the same predicted action policy data, and continuously adjust the model parameters in the initial reward model according to the comparison result until the difference value between the second reward parameter and the first reward parameter obtained based on the adjusted reward model is smaller than a preset difference threshold value, and then determine that the current reward model meets the precision requirement, and determine that the current reward model is the preset reward model.

In some embodiments, referring to fig. 5, the server may also use another way to establish a predetermined reward obtaining model. Specifically, the method may include: and acquiring historical training records when the model is trained through reinforcement learning historically. And extracting sample state data, sample action strategy data and sample reward data adopted corresponding to the sample state data and the sample action strategy data from the historical training records. Further, the sample state data, the sample action strategy data, and the sample reward data, which correspond to each other, may be used as a set of training data. In the above manner, sets of training data for training the reward model may be obtained from historical training records. And establishing a preset reward model which can determine proper reward data according to the state data and the action strategy data corresponding to the state data through model learning of the multiple groups of training data.

In some embodiments, the tag may specifically include: name tags for services, name tags for operations in services, name tags for objects where operations in services execute, and the like. It should be understood that the above-listed labels are only illustrative. In specific implementation, tags of other types and contents can be introduced according to specific application scenarios. The present specification is not limited to these.

Embodiments of the present specification further provide a server, including a processor and a memory for storing processor-executable instructions, where the processor, when implemented, may perform the following steps according to the instructions: acquiring click state data of a first sample user for a current tag, and current action strategy data determined by a preset question model according to the click state data of the first sample user for the current tag, wherein the current action strategy data comprise: clicking operation of a first sample user for a next group of labels, or proposing a target problem by the first sample user; and calling a preset reward model to determine reward data fed back to the preset question model according to the click state data of the first sample user aiming at the current label and the current action strategy data.

In order to more accurately complete the above instructions, referring to fig. 6, another specific server is provided in the embodiments of the present specification, where the server includes a network communication port 601, a processor 602, and a memory 603, and the above structures are connected by an internal cable, so that the structures may perform specific data interaction.

The network communication port 601 may be specifically configured to obtain click state data of a first sample user for a current tag, and current action policy data determined by a preset question model according to the click state data of the first sample user for the current tag, where the current action policy data includes: and clicking operation of the first sample user for the next group of labels, or raising a target problem by the first sample user.

The processor 602 may be specifically configured to invoke a preset reward model, and determine reward data fed back to the preset question module according to the click state data of the first sample user for the current tag and the current action policy data.

The memory 603 may be specifically configured to store a corresponding instruction program.

In this embodiment, the network communication port 601 may be a virtual port bound with different communication protocols, so that different data can be sent or received. For example, the network communication port may be port No. 80 responsible for web data communication, port No. 21 responsible for FTP data communication, or port No. 25 responsible for mail data communication. In addition, the network communication port can also be a communication interface or a communication chip of an entity. For example, it may be a wireless mobile network communication chip, such as GSM, CDMA, etc.; it can also be a Wifi chip; it may also be a bluetooth chip.

In this embodiment, the processor 602 may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The description is not intended to be limiting.

In this embodiment, the memory 603 may include multiple layers, and in a digital system, the memory may be any memory as long as binary data can be stored; in an integrated circuit, a circuit without a physical form and with a storage function is also called a memory, such as a RAM, a FIFO and the like; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card and the like.

The present specification further provides a computer storage medium based on the above-mentioned method for determining reward data, the computer storage medium storing computer program instructions, which when executed, implement: acquiring click state data of a first sample user for a current tag, and current action strategy data determined by a preset question model according to the click state data of the first sample user for the current tag, wherein the current action strategy data comprise: clicking operation of a first sample user for a next group of labels, or proposing a target problem by the first sample user; and calling a preset reward model to determine reward data fed back to the preset question model according to the click state data of the first sample user aiming at the current label and the current action strategy data.

In this embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard disk (Hard disk drive, HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.

In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer storage medium can be explained by comparing with other embodiments, and are not described herein again.

The embodiment of the present specification further provides a method for determining reward data, which is specifically applied to a server side, and when the method is specifically applied, the following contents may be included.

S1: acquiring current state data and current action strategy data determined by a preset processing model according to the current state data;

s2: and calling a preset reward model to determine reward data fed back to the preset processing model according to the current state data and the current action strategy data.

In this embodiment, the current state data may specifically be data for characterizing the current state of the target object. For example, the current click operation data of the user for the displayed label may be, the current environmental data such as the current city position of the traveler, the weather of the city, the traffic of the city, the current blood glucose value and the blood pressure value of the patient, and the like. Of course, the above listed current state data is only an illustrative illustration. In specific implementation, according to specific application scenarios and processing needs, other content and types of data may also be introduced as current state data. The present specification is not limited to these.

In some embodiments, the current action policy data may be specifically understood as action data that is predicted based on the processing policy and is likely to be taken by the target object based on the current state data. And the current action strategy data corresponds to the current state data. For example, if the current state data is the click operation data of the user currently for the displayed label. Accordingly, the current action policy data may be click operation data of the user for the next set of presented tags, or a target question that the user wants to ask. And if the current state data is environment data such as the current urban position of the passenger, the weather of the city, the traffic of the city and the like. Accordingly, the current action policy data may be the mode of transportation or the like that the passenger will select next.

In some embodiments, the current state data may be used as a model input to a predetermined process model. And operating the preset processing model to obtain corresponding model output as the current action strategy data corresponding to the current state data.

In some embodiments, the preset processing model may specifically include a model capable of predicting corresponding current action policy data according to the current state data based on the grasped processing policy.

In some embodiments, after a preset reward model is called and reward data fed back to a preset processing model is determined according to the current state data and the current action policy data, when the method is implemented, the following may be further included: and performing reinforcement learning on the preset processing model according to the reward data to obtain a preset processing model meeting the requirement, wherein the preset processing model meeting the requirement is used for determining an action strategy corresponding to the state data according to the state data.

Through the embodiment, a trained preset reward model can be used for replacing technicians, corresponding reward data is given according to current action strategy data determined by the preset processing model based on current state data and is fed back to the preset processing model, so that corresponding reinforcement learning can be carried out on the preset processing model according to the reward data, and the processing model with high accuracy and good effect is obtained.

Referring to fig. 7, on a software level, the present specification further provides a device for determining reward data, which may specifically include the following structural modules.

The obtaining module 701 may be specifically configured to obtain click state data of a first sample user for a current tag, and current action policy data determined by a preset question model according to the click state data of the first sample user for the current tag, where the current action policy data includes: clicking operation of a first sample user for a next group of labels, or proposing a target problem by the first sample user;

the determining module 703 may be specifically configured to invoke a preset rewarding model to determine rewarding data fed back to the preset questioning model according to the click state data of the first sample user for the current tag and the current action policy data.

In some embodiments, the apparatus may further include a reinforcement learning module, which is specifically configured to perform reinforcement learning on the preset question model according to the reward data to obtain a preset question model meeting requirements, where the preset question model meeting requirements is used to predict the target question of the user according to click operation data of the user for multiple sets of tags.

In some embodiments, the apparatus may further include an establishing module, which may be specifically configured to establish a preset reward model, where the establishing module may specifically include the following structural units:

the acquisition unit is specifically used for acquiring click operation data of a second sample user for multiple groups of labels and target problems of the second sample user as sample data;

the learning unit may be specifically configured to learn the sample data to obtain a preset reward model.

In some embodiments, the learning unit may be specifically configured to determine, according to the sample data, a plurality of reward parameters for a plurality of action policy data determined based on a preset question asking model; determining a jackpot based on the plurality of reward parameters; constructing a target loss function according to the accumulated rewards; and establishing the preset reward model according to the target loss function.

In some embodiments, the tag may specifically include: name tags for services, name tags for operations in services, name tags for objects where operations in services execute, and the like.

In some embodiments, the learning unit may be further configured to establish an initial reward model; according to the sample data and a preset reward rule, determining a plurality of first reward parameters aiming at a plurality of action strategy data determined based on a preset question model; determining a plurality of second reward parameters aiming at a plurality of action strategy data determined based on a preset question model according to the initial reward model; and adjusting the initial reward model according to the first reward parameters and the second reward parameters to obtain the preset reward model.

It should be noted that, the units, devices, modules, etc. illustrated in the above embodiments may be implemented by a computer chip or an entity, or implemented by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. It is to be understood that, in implementing the present specification, functions of each module may be implemented in one or more pieces of software and/or hardware, or a module that implements the same function may be implemented by a combination of a plurality of sub-modules or sub-units, or the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

As can be seen from the above, in the determination device for reward data provided in the embodiments of the present specification, the click state data of the first sample user for the current tag is obtained by the obtaining module, and the current action policy data determined by the preset questioning model according to the click state data of the first sample user for the current tag is obtained; and calling a pre-trained preset reward model through a determining module, and determining reward data fed back to the preset questioning model according to the click state data of the first sample user for the current label and the current action strategy data so as to perform reinforcement learning on the preset questioning model. Therefore, the reward data for reinforcement learning can be rapidly and accurately acquired.

Although the present specification provides method steps as described in the examples or flowcharts, additional or fewer steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an apparatus or client product in practice executes, it may execute sequentially or in parallel (e.g., in a parallel processor or multithreaded processing environment, or even in a distributed data processing environment) according to the embodiments or methods shown in the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded. The terms first, second, etc. are used to denote names, but not any particular order.

Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

From the above description of the embodiments, it is clear to those skilled in the art that the present specification can be implemented by software plus necessary general hardware platform. With this understanding, the technical solutions in the present specification may be essentially embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a mobile terminal, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments in the present specification.

The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. The description is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

While the specification has been described with examples, those skilled in the art will appreciate that there are numerous variations and permutations of the specification that do not depart from the spirit of the specification, and it is intended that the appended claims include such variations and modifications that do not depart from the spirit of the specification.

Claims

1. A method of determining reward data, comprising:

acquiring click state data of a first sample user for a current tag, and current action strategy data determined by a preset question model according to the click state data of the first sample user for the current tag, wherein the current action strategy data comprise: clicking operation of a first sample user for a next group of labels, or proposing a target problem by the first sample user;

and calling a preset reward model to determine reward data fed back to the preset question model according to the click state data of the first sample user aiming at the current label and the current action strategy data.

2. The method according to claim 1, after invoking a preset rewarding model to determine rewarding data fed back to a preset questioning model according to the click state data of the first sample user for the current tag and the current action policy data, the method further comprising:

and performing reinforcement learning on the preset questioning model according to the reward data to obtain a preset questioning model meeting requirements, wherein the preset questioning model meeting the requirements is used for predicting the target problem of the user according to the click operation data of the user for multiple groups of labels.

3. The method of claim 1, wherein the predetermined reward pattern is obtained by:

acquiring click operation data of a second sample user for multiple groups of labels and target problems of the second sample user as sample data;

and learning the sample data to obtain a preset reward model.

4. The method of claim 3, learning the sample data to obtain a preset reward model, comprising:

according to the sample data, determining a plurality of reward parameters of a plurality of action strategy data determined based on a preset question model;

determining a jackpot based on the plurality of reward parameters;

constructing a target loss function according to the accumulated rewards;

and establishing the preset reward model according to the target loss function.

5. The method of claim 1, the tag comprising: name label of service, name label of operation in service, and name label of operation execution object in service.

6. The method of claim 3, obtaining a preset reward model by learning the sample data, further comprising:

establishing an initial reward model;

according to the sample data and a preset reward rule, determining a plurality of first reward parameters aiming at a plurality of action strategy data determined based on a preset question model;

determining a plurality of second reward parameters aiming at a plurality of action strategy data determined based on a preset question model according to the initial reward model;

and adjusting the initial reward model according to the first reward parameters and the second reward parameters to obtain the preset reward model.

7. A method of determining reward data, comprising:

acquiring current state data and current action strategy data determined by a preset processing model according to the current state data;

and calling a preset reward model to determine reward data fed back to the preset processing model according to the current state data and the current action strategy data.

8. The method of claim 7, after invoking a preset reward model to determine reward data to be fed back to a preset processing model according to the current state data and the current action policy data, the method further comprising:

and performing reinforcement learning on the preset processing model according to the reward data to obtain a preset processing model meeting the requirement, wherein the preset processing model meeting the requirement is used for determining an action strategy corresponding to the state data according to the state data.

9. A reward data determination apparatus comprising:

the obtaining module is configured to obtain click state data of a first sample user for a current tag, and current action policy data determined by a preset question model according to the click state data of the first sample user for the current tag, where the current action policy data includes: clicking operation of a first sample user for a next group of labels, or proposing a target problem by the first sample user;

and the determining module is used for calling a preset reward model to determine reward data fed back to the preset questioning model according to the click state data of the first sample user aiming at the current label and the current action strategy data.

10. The device according to claim 9, further comprising a reinforcement learning module, configured to perform reinforcement learning on the preset question model according to the reward data to obtain a preset question model meeting requirements, where the preset question model meeting requirements is used to predict a target question of a user according to click operation data of the user for multiple sets of tags.

11. The apparatus of claim 10, further comprising a building module for building a predetermined reward model, the building module comprising:

the acquisition unit is used for acquiring click operation data of a second sample user for multiple groups of labels and target problems of the second sample user as sample data;

and the learning unit is used for learning the sample data to acquire a preset reward model.

12. The device according to claim 11, wherein the learning unit is specifically configured to determine, according to the sample data, a plurality of reward parameters for a plurality of action policy data determined based on a preset question model; determining a jackpot based on the plurality of reward parameters; constructing a target loss function according to the accumulated rewards; and establishing the preset reward model according to the target loss function.

13. The apparatus of claim 9, the tag comprising: name label of service, name label of operation in service, and name label of operation execution object in service.

14. The apparatus according to claim 11, wherein the learning unit is further configured to establish an initial reward model; according to the sample data and a preset reward rule, determining a plurality of first reward parameters aiming at a plurality of action strategy data determined based on a preset question model; determining a plurality of second reward parameters aiming at a plurality of action strategy data determined based on a preset question model according to the initial reward model; and adjusting the initial reward model according to the first reward parameters and the second reward parameters to obtain the preset reward model.

15. A server comprising a processor and a memory for storing processor-executable instructions which, when executed by the processor, implement the steps of the method of any one of claims 1 to 6.

16. A computer readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1 to 6.