CN111046156B

CN111046156B - Method, device and server for determining rewarding data

Info

Publication number: CN111046156B
Application number: CN201911199043.6A
Authority: CN
Inventors: 张琳; 梁忠平
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2023-10-13
Anticipated expiration: 2039-11-29
Also published as: CN111046156A

Abstract

The specification provides a method, a device and a server for determining bonus data. In one embodiment, the method for determining the reward data includes the steps that click state data of a first sample user for a current tag is obtained first, and current action strategy data determined by a preset questioning model according to the click state data of the first sample user for the current tag; and determining reward data for reinforcement learning fed back to the preset questioning model by calling a preset reward model trained in advance according to the click state data of the first sample user aiming at the current label and the current action strategy data. So that bonus data for reinforcement learning can be quickly and accurately acquired.

Description

Method, device and server for determining rewarding data

Technical Field

The specification belongs to the technical field of Internet, and particularly relates to a method, a device and a server for determining bonus data.

Background

In many scenarios (e.g., a customer service reply scenario of APP), to enhance the user experience, a pre-trained and built model is often utilized to automatically predict specific questions that the user wants to ask based on collected behavior data of the user (e.g., clicking operations of the user on the displayed sets of labels). And searching and feeding back corresponding answers to the user in time aiming at the problem.

Wherein the model is usually obtained through reinforcement learning. In the process of training the related model through reinforcement learning, proper reward data is specifically needed to be fed back to the model, so that the model can be continuously guided to find a better processing strategy by using the reward data to predict a target problem which a user wants to ask.

Accordingly, there is a need for a method of obtaining bonus data for reinforcement learning.

Disclosure of Invention

The specification provides a method, a device and a server for determining bonus data, so that the bonus data for reinforcement learning can be quickly and accurately obtained.

The method, the device and the server for determining the bonus data provided by the specification are realized in the following way:

a method of determining bonus data, comprising: acquiring click state data of a first sample user aiming at a current tag, and determining current action strategy data of a preset questioning model according to the click state data of the first sample user aiming at the current tag, wherein the current action strategy data comprises: clicking operation of the first sample user on the next group of labels, or putting forward a target problem by the first sample user; and calling a preset rewarding model, and determining rewarding data fed back to the preset questioning model according to the clicking state data of the first sample user aiming at the current label and the current action strategy data.

A method of determining bonus data, comprising: acquiring current state data and current action strategy data determined by a preset processing model according to the current state data; and calling a preset rewarding model to determine rewarding data fed back to a preset processing model according to the current state data and the current action strategy data.

A bonus data determining apparatus, comprising: the system comprises an acquisition module, a query module and a query module, wherein the acquisition module is used for acquiring click state data of a first sample user for a current tag and current action strategy data determined by a preset questioning model according to the click state data of the first sample user for the current tag, and the current action strategy data comprises: clicking operation of the first sample user on the next group of labels, or putting forward a target problem by the first sample user; and the determining module is used for calling a preset rewarding model and determining rewarding data fed back to the preset questioning model according to the clicking state data of the first sample user aiming at the current label and the current action strategy data.

A server comprising a processor and a memory for storing processor executable instructions, wherein the processor, when executing the instructions, implements obtaining click state data of a first sample user for a current tag, and current action policy data determined by a preset question model according to the click state data of the first sample user for the current tag, wherein the current action policy data comprises: clicking operation of the first sample user on the next group of labels, or putting forward a target problem by the first sample user; and calling a preset rewarding model, and determining rewarding data fed back to the preset questioning model according to the clicking state data of the first sample user aiming at the current label and the current action strategy data.

A computer readable storage medium having stored thereon computer instructions that, when executed, enable obtaining click state data for a current tag for a first sample user, and current action policy data determined by a preset questioning model based on the click state data for the current tag for the first sample user, wherein the current action policy data comprises: clicking operation of the first sample user on the next group of labels, or putting forward a target problem by the first sample user; and calling a preset rewarding model, and determining rewarding data fed back to the preset questioning model according to the clicking state data of the first sample user aiming at the current label and the current action strategy data.

According to the method, the device and the server for determining the reward data, click state data of a first sample user for a current tag is obtained firstly, and a preset questioning model determines current action strategy data according to the click state data of the first sample user for the current tag; and determining the reward data fed back to the preset questioning model according to the click state data of the first sample user aiming at the current label and the current action strategy data by calling the preset reward model trained in advance, and performing reinforcement learning on the preset questioning model. Therefore, the reward data with good training effect for reinforcement learning can be obtained rapidly and accurately.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure, the drawings that are required for the embodiments will be briefly described below, in which the drawings are only some of the embodiments described in the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of one embodiment of a method for determining bonus data provided by embodiments of the present specification, in one example of a scenario;

FIG. 2 is a schematic diagram of one embodiment of a method for determining bonus data provided by embodiments of the present specification, in one example of a scenario;

FIG. 3 is a schematic diagram of one embodiment of a method for determining bonus data provided by an embodiment of the present description, in one example of a scenario;

FIG. 4 is a flow chart of a method for determining bonus data provided in an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of obtaining a preset rewards model provided by one embodiment of the present disclosure;

FIG. 6 is a schematic diagram of the structural composition of a server provided in one embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a determining apparatus for bonus data provided in an embodiment of the present specification.

Detailed Description

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The embodiment of the specification provides a method for determining bonus data, which can be particularly applied to a server of a data processing system.

In specific implementation, the server may be specifically configured to acquire and utilize click operation data of the first sample user for multiple groups of tags, and target questions of the first sample user, and train to obtain a preset question model meeting requirements, where the preset question model can predict target questions that the user wants to question according to the click operation of the user for multiple groups of tags through reinforcement learning. In the reinforcement learning process, the server may first obtain click state data of the first sample user for the current tag, and current action policy data determined by the preset question model according to the click state data of the first sample user for the current tag, where the current action policy data includes: clicking operation of the first sample user on the next group of labels, or putting forward a target problem by the first sample user; and calling a preset rewarding model, and determining rewarding data fed back to the preset questioning model according to the clicking state data of the first sample user aiming at the current label and the current action strategy data. And then, continuously utilizing the reward data obtained based on the mode to guide the training of the preset questioning model to find out a better strategy to accurately predict the target problem which the user wants to put forward so as to obtain the preset questioning model meeting the requirements.

In this embodiment, the server may specifically include a background service server applied to a service platform side and capable of implementing functions such as data transmission and data processing. Specifically, the server may be an electronic device having data operation, storage function and network interaction function; software programs that support data processing, storage, and network interactions may also be provided for running in the electronic device. In the present embodiment, the number of servers is not particularly limited. The server may be one server, several servers, or a server cluster formed by several servers.

In a scenario example, referring to fig. 1, the method for determining bonus data provided in the embodiment of the present disclosure may be used to automatically determine suitable bonus data, so that a preset questioning model meeting the requirements may be obtained continuously through reinforcement training by using the bonus data.

In the example of the scene, the A network company plans to add an intelligent customer service answering function on a mobile phone APP of a certain treasured issued by the A network company, so that the problem of a user when using the APP can be solved timely and quickly. In order to improve the use experience of the user, the network company A hopes to train a preset question model aiming at the customer service answer scene, through which the user can guide, collect and intelligently determine the questions which the user wants to ask according to the behavior data when asking the user without directly inputting the questions which want to ask, and then feed back the answers of the questions to the user according to the determined questions.

Specifically, referring to fig. 2, when the user clicks the "my customer service" icon in the first page of a certain APP installed on the mobile phone, the user enters into the customer service dialogue interface in the "my customer service". At this time, the APP may sequentially present a plurality of sets of tags to the user according to user attribute information (e.g., gender, age, transaction record, academic history, etc. of the user) of the currently logged-in user. The user can click on one or more labels in the groups of labels displayed on the mobile phone according to the questions to be asked. The APP can collect clicking operation of the user on the tag as behavior data when the user asks, and determines standard questions which the user wants to ask according to the behavior data and a preset question library through a preset question model. For example, it is determined that the standard question the user wants to ask is "how to query the transaction record". Further, the APP may search for answers matching the standard questions and feed back to the user. And in particular, reference is made to fig. 3.

The label may specifically include a name label of the related service, for example, "transaction service" or the like; name tags for operations supported in the business, such as "queries" and the like, may also be included; a name tag of an operation-specifying object involved in the business, such as "trade order", etc., may also be included. Of course, the labels listed above are only one illustrative type. In specific implementation, other content or forms of labels may also be included according to specific business scenarios and processing requirements. The present specification is not limited to this.

In order to achieve the above-mentioned functions, it is first necessary to train a preset question model capable of predicting the next action of the user and the question the user wants to ask according to the collected click operation of the user on the displayed label.

In this scenario example, the server may train and build a preset question model meeting the requirements by using sample data collected in advance, for example, click operation data of the first sample user participating in the test for multiple groups of labels, and target questions presented by the first sample user through a reinforcement learning manner.

Specifically, the server may first establish an initial policy model, and then predict, based on the established processing policy, the label clicked by the first sample user next time according to the click operation data of the first sample user on the label by using the initial policy model, and finally, the target problem that the first sample user wants to ask. Further, the prediction result is fed back to the reward data corresponding to the model, so that the initial strategy model can continuously optimize the processing strategy used by the model according to the obtained reward data after each prediction, and accordingly a relatively accurate processing strategy can be gradually learned and built, and the target problem to be proposed by the user can be predicted more accurately according to the clicking operation of the user on the label based on the processing strategy.

In particular, the server may store click state data (S1) of the first sample user a in the sample data for the first set of labels displayed: user a clicks tag 1 and tag 2, and user characteristic data of user a, such as gender, age, occupation, month income, etc., of user a is input into the initial policy model, and the initial policy model is operated, so that the initial policy model can predict, based on the current tag click state data of user a and user characteristics of user a, that the clicking operation of user a on the next set of presented tags (e.g., the second set of tags) is: tab 4 and tab 5 are clicked as the next action policy data (may be denoted as a 1) predicted for the click state data of the current tab of user a.

Further, the server may determine corresponding reward data (may be denoted as r 1) according to the action policy data predicted by the original policy model according to the original processing policy, so as to guide the original policy model to continuously improve and optimize the used processing policy.

Specifically, the server may input, as a model, action policy data a1 of the user a predicted by the initial policy model and click state data (S1) of a current tag of the user a into a preset reward model trained in advance for determining appropriate reward data, and run the preset reward model to obtain model outputs corresponding to a1 and S1, and record, as reward data for the state data S1 and action policy data a1 adopted for the state data S1 fed back to the initial policy model, r1. So that the following initial policy model can learn and adjust the processing policy used before according to the reward data r1, and further can more accurately predict the action policy data of the user by adopting a better processing policy.

Further, the policy model in the above process may update the tab click state data of the user a according to the predicted action policy data a1 to obtain the tab click state data of the next user a (may be denoted as S2): user a clicks on tag 4 and tag 5. At this time, the server may input the newly updated tab click state data S2 as a current tab click state, together with the user characteristics of the user a as a model input, to an initial policy model, and run the policy model to predict the next action policy data (may be denoted as a 2) for the state: user a asks how to query the transaction record. That is, it is predicted how the target question that user a wants to ask is to query the transaction record.

Similarly, the server also takes the tag click state data S2 and the predicted corresponding action policy data a2 as model inputs, inputs the model inputs into a preset reward model, and obtains corresponding model outputs by running the preset reward model, and takes the model outputs as reward data r2 for the state data S2 and the action policy data a2 adopted for the state data S2 fed back to the initial policy model.

Furthermore, the server can adjust and optimize the processing strategy used by the initial strategy model according to the reward data r1 and r2, so that reinforcement learning of the sample data of the first sample user A in the sample data is completed.

According to the method, the server can continuously utilize the sample data to adjust and optimize the processing strategy used by the strategy model for multiple times until the error of the target problem of the first sample user question finally determined based on the strategy model is relatively small, for example, when the error is smaller than a preset error value, the reinforcement training is completed, and a preset question model with high accuracy and meeting the requirements is obtained.

Thus, the corresponding reward data can be set manually without depending on the related knowledge and experience of the technician, the conditions that the existence of the reward data is easily influenced by subjective factors (such as processing experience, knowledge background and the like) of the technician, the value of the reward data is discrete, errors are easily generated and the like depending on the setting of the reward data by the technician are avoided, and the accuracy of the determined reward data and the efficiency of determining the reward data are improved.

In another scenario example, the server may collect the label click operation data of the second sample user participating in the test, and the target problem of the second sample user that wants to ask, and train to obtain the preset reward model capable of determining the appropriate reward data more accurately by using the data.

In this scenario example, company a may organize a batch of test users to click on multiple sets of labels displayed by APP, respectively, to describe the target questions that the users want to ask, and finally explicitly input the target questions that the users want to ask. The server may collect the label clicking operation of the user and the input target problem as clicking operation data of the second sample user for multiple groups of labels and the target problem of the second sample user. For example, user B who acquired participation in the test clicks on tab 2 and tab 3 in the first set of tabs shown, clicks on tab 4 and tab 6 in the second set of tabs shown, and user B enters the target question in the last dialog: how to query sesame credits.

In specific implementation, the server may invoke the initial policy model to perform reinforcement learning on the tag click data and the target problem of the second sample user, and a skilled technician manually sets corresponding reward parameters for action policy data corresponding to the state data of the second sample user determined by the initial preset questioning model for the preset questioning model according to corresponding preset reward rules. Further, the server may obtain a plurality of reward parameters for a technician for a plurality of action policy data for the same second sample user, and calculate the jackpot based on the plurality of reward parameters. And constructing a target loss function according to the accumulated rewards. Wherein the objective loss function comprises a preset rewarding model. And further, according to the target loss function, the model parameters in the corresponding preset rewarding model can be determined by solving the optimal value of the target loss function, so that the preset rewarding model is built and obtained.

In another example scenario, the server may also first obtain a plurality of first reward parameters determined by a technician according to a preset reward rule, and manually set a plurality of action policy data determined based on an initial policy model. Meanwhile, the server may also establish an initial rewards model and determine a plurality of second rewards parameters for a plurality of action policy data determined based on the initial policy model using the initial rewards model. And the server can purposefully adjust the initial rewarding model for multiple times according to the first rewarding parameters and the second rewarding parameters until the difference value of the first rewarding parameters and the second rewarding parameters is smaller than or equal to a preset difference threshold value, so that a preset rewarding model with higher accuracy is obtained.

In another example scenario, the server may also gather historical training records from which historically passed through the reinforcement learning training model. Sample state data, sample action policy data, and sample reward data employed corresponding to the sample state data and sample action policy data are extracted from the historical training record. Further, the sample state data, the sample action policy data, and the sample reward data, which correspond to each other, may be used as a set of training data. In the manner described above, multiple sets of training data for training the reward model may be derived from the historical training record. And then, through model learning of the plurality of groups of training data, a preset rewarding model capable of determining proper rewarding data according to the state data and action strategy data corresponding to the state data is established.

Of course, it should be noted that the above-listed manner of obtaining the predetermined reward model is only a schematic illustration. In specific implementation, one of the acquisition modes may be selected according to specific situations and processing requirements, or other suitable acquisition modes may be used to acquire the preset reward model. In this regard, the description is not repeated.

Referring to fig. 4, an embodiment of the present disclosure provides a method for determining bonus data, where the method is specifically applied to a server side. In particular implementations, the method may include the following.

S401: acquiring click state data of a first sample user aiming at a current tag, and determining current action strategy data of a preset questioning model according to the click state data of the first sample user aiming at the current tag, wherein the current action strategy data comprises: the first sample user may click on the next set of labels or the first sample user may ask a target question.

In some embodiments, the click state data of the first sample user for the current tag may specifically include: and the users participating in training the preset questioning model aim at clicking operation data of the current label in the displayed multiple groups of labels. For example, the first sample user clicks on tag 4 and tag 5 in the set of tags for the second set of tags currently being displayed, and does not click on the other tags in the set (e.g., tag 1, tag 2, and tag 3).

The tag may specifically include tag data associated with the problem and capable of describing one or more relevant attribute features of the problem. The preset question model may specifically include a model capable of predicting a next action of the user according to the user's click operation on the displayed tag, and a target question the user wants to question.

In some embodiments, in implementation, the server may first sequentially present a plurality of different sets of labels to the first sample user. For example, a first set of labels is displayed first, then a second set of labels is displayed, and so on. The first sample user can respectively select and click the labels which are associated with the target problem and can describe one or more relevant attribute characteristics of the target problem in the displayed groups of labels according to the target problem which the user wants to ask. Meanwhile, the first sample user can input the target questions which the user wants to ask according to the instruction. Thus, the server can acquire click operation data of the first sample user on a plurality of groups of labels and target problems of the first sample user corresponding to the click operation data, and the target problems are used as sample data for training and establishing a preset questioning model subsequently.

In some embodiments, the server may perform reinforcement learning by using the collected sample data (including the click operation data of the first sample user for the tag and the target problem of the first sample user), so as to establish a preset question model with high accuracy and meeting the requirements.

In implementation, the server may first establish an initial policy model as a preset questioning model. The corresponding processing strategy can be randomly generated through the preset questioning model, and action data (which can be recorded as action strategy data) of the next step of the user can be predicted according to the current label clicking state data of the user based on the processing strategy, wherein the action data comprises labels which the user is likely to click on next step or target questions which the user wants to ask questions and the like.

Specifically, the server may input the obtained click state data of the first sample user for the current tag (for example, the first sample user D selects to click on the tag 1 and the tag 2 in the displayed first group of tags, and does not click on other tags in the group of tags) as a model input, and input the model input to the preset question model. The preset questioning model is operated, so that the preset questioning model can obtain model output corresponding to the click state data of the first sample user for the current label according to the click state data of the first sample user for the current label (for example, the first sample user D can select and click the label 6 and the label 7 in the second set of labels displayed next and can not click other labels in the set of labels) as corresponding current action strategy data, thereby predicting the action possibly taken by the sample user in the next higher probability.

However, since the preset questioning model is predicted based on a randomly generated processing strategy, the processing strategy does not learn the sample data yet, so the accuracy of the processing strategy itself is not high, and the error tends to be relatively large when predicting the action strategy data based on the processing strategy, so that the preset questioning model is not satisfactory.

In this embodiment, during implementation, the corresponding action policy data may be predicted based on the tag click state data of the first sample user based on the owned processing policy by using the preset question model based on reinforcement learning. And determining the rewarding data according to the predicted action strategy data and the label click state data of the corresponding first sample user. And the processing strategy used by the preset questioning model can be continuously optimized and improved by utilizing the rewarding data, so that the action strategy data of the user can be predicted more accurately based on the click state data of the user for the label based on the optimized and improved processing strategy, and the target problem of the user for questioning can be determined.

The reward data may specifically include parameter data for guiding a learning direction of reinforcement learning. According to the rewarding data, in the reinforcement learning process, the model can automatically learn and improve to the learning direction of the better processing strategy with relatively higher probability so as to continuously adjust and optimize the used processing strategy.

In the process of determining the reward data, a technician manually sets a proper reward value as the reward data to be fed back to a preset questioning model according to action strategy data predicted by the preset questioning model based on the label click state data of the user for each time based on the used processing strategy based on own knowledge background and processing experience.

The bonus data determined in the above manner is often affected by subjective factors of technicians (including knowledge background of the technicians, processing experience, etc.), so that the determined bonus data is often inaccurate and stable, and is prone to errors. In addition, the specific reward data is manually set by a technician, so that the determined multiple reward data are often discrete and discontinuous in value, and the effect is often not ideal when guiding the optimized and improved processing strategy of the preset questioning model.

In this embodiment, in the implementation, the server may use a pre-trained preset reward model to automatically and accurately determine the appropriate reward data according to the click state data of the current tag of the first sample user input to the preset question model each time, and the preset question model determines the appropriate reward data based on the current action policy data determined by the click state data of the current tag, without relying on a technician to manually set the reward data.

S403: and calling a preset rewarding model, and determining rewarding data fed back to the preset questioning model according to the clicking state data of the first sample user aiming at the current label and the current action strategy data.

The preset reward model may specifically include a model that is trained in advance and that can determine, according to click state data of a user on a tag and action policy data predicted according to the click state data, reward data that is fed back to the model for reinforcement learning training.

In some embodiments, in implementation, the server may input, as a set of model inputs, the click state data of the first sample user for the currently displayed tag, and current action policy data corresponding to the click state data of the tag determined by the preset questioning model according to the click state data of the first sample user for the currently displayed tag, into the preset rewarding model. And running the preset rewarding model to obtain model output corresponding to the model input of the group, wherein the model output is used as rewarding data of a processing strategy adopted when the preset questioning model is used for determining the current action strategy data according to the click state data of the first sample user for the currently displayed label.

In some embodiments, after obtaining the reward data fed back to the preset questioning model in the above manner, the server may further perform reinforcement learning on the preset questioning model by using the reward data, so that the preset questioning model can continuously modify and optimize the processing strategy used according to the reward data. And when the action strategy corresponding to the click state data of the same or similar label is predicted and judged by the follow-up preset questioning model, the corresponding action strategy data can be more accurately determined by adopting the modified and optimized processing strategy, and the model precision of the preset questioning model is improved.

From the above, in the method for determining the reward data provided in the embodiment of the present disclosure, the click state data of the first sample user for the current tag is obtained first, and the current action policy data determined by the preset questioning model according to the click state data of the first sample user for the current tag is obtained; and then determining the reward data fed back to the preset questioning model according to the click state data of the first sample user aiming at the current label and the current action strategy data by calling the preset reward model trained in advance, so as to perform reinforcement learning on the preset questioning model. So that bonus data for reinforcement learning can be quickly and accurately acquired.

In some embodiments, the preset reward model may be specifically obtained in the following manner: acquiring click operation data of a second sample user aiming at a plurality of groups of labels, and taking target problems of the second sample user as sample data; and acquiring a preset rewarding model by learning the sample data.

In some embodiments, in implementation, the server may display multiple sets of labels to the second sample user participating in the test in advance, and guide the second sample user to describe the target problem to be asked by clicking the relevant labels in each set, so that clicking operation data of the second sample user for the multiple sets of labels may be acquired. Meanwhile, the server can guide the user to input the target questions to be asked at last or just at the beginning, so that the server can acquire the target questions corresponding to the click operation data of the second sample user for a plurality of groups of labels at the same time. And combining the click operation data of the second sample user for the plurality of groups of labels and the target problems corresponding to the click operation data of the second sample user for the plurality of groups of labels together to form one group of data. In the above manner, a plurality of sets of the above data may be obtained as sample data for training a preset bonus model.

Further, the sample data can be learned to establish a preset rewarding model.

In a specific implementation, the learning the sample data to obtain a preset reward model may include the following: determining a plurality of rewards parameters for a plurality of action strategy data determined based on a preset questioning model according to the sample data; determining a jackpot according to the plurality of prize parameters; constructing a target loss function according to the accumulated rewards; and establishing the preset rewarding model according to the target loss function.

In this embodiment, during implementation, the server may first establish an initial policy model as a preset questioning model, and call the initial policy model to perform reinforcement learning on the sample data, and manually set, by a technician according to a corresponding preset reward rule, a corresponding reward value as a reward parameter for action policy data predicted by each round of clicking operation data of a second sample user in the sample data according to the initial policy model. Further, the server may gather rewards parameters that the technician gives for multiple rounds of predictions of the initial policy model to determine the corresponding jackpot. And then a target loss function comprising a preset reward model can be established according to the accumulated rewards. And then determining and establishing a corresponding preset rewarding model by solving the optimal value, such as the minimum value, of the target loss function.

Specifically, the jackpot may be determined as follows:

wherein G is _t May be expressed in particular as a jackpot, r _t Specifically, the method can be expressed as corresponding rewarding parameters, gamma, which are set for action strategy data predicted by a t-th round of clicking operation data in the clicking operation data of a second sample user in the sample data ^k Specifically, the calculation may be expressed as a discount data after k rounds of the action policy data predicted based on the T-th round, the calculation γ may be expressed as a discount factor, and the calculation T may be expressed as a total round number predicted by action policy data for click operation data of a second sample user among the sample data.

In this embodiment, the preset bonus model may be denoted as R. Further, a target loss function including a predetermined reward model R to be determined may be established according to the obtained jackpot.

Specifically, the target loss function including the preset reward model may be established according to the above-mentioned jackpot according to the following manner:

L ₁ (σ)＝(R(s _t ,a _t ；σ)-sigmoid(G _t )) ²

wherein sigma can be specifically expressed as model parameters in a preset rewarding model, R can be specifically expressed as a preset rewarding model, s _t Specifically, the click operation data (or click state data) of the second sample user input in the t-th round of prediction of the preset questioning model can be expressed as a _t Can be expressed specifically as s-based _t The corresponding action policy data predicted by the t-th round, sigmoid () can be specifically expressed as Sigmoid (sigma) activation function, L ₁ In particular as a target loss function.

The Sigmoid (sigma) activation function may be expressed in the following form:

after the target loss function is obtained, a preset rewarding model can be built and obtained according to the target loss function. Specifically, the preset reward model R may be determined by continuously searching and solving an optimal value (e.g., a minimum value) of the target loss function, so as to gradually calculate and determine each model parameter σ in the preset reward model R.

Of course, it should be noted that the above-listed manner of obtaining the predetermined reward model is only a schematic illustration. In the implementation, the acquisition preset reward model can be established in other suitable modes according to specific application scenes and processing requirements.

In some embodiments, the sample data may also be learned and a predetermined reward model may be established by the following manner.

Establishing an initial rewarding model; determining a plurality of first rewards parameters for a plurality of action strategy data determined based on a preset questioning model according to the sample data and a preset rewards rule; determining a plurality of second rewards parameters for a plurality of action strategy data determined based on a preset questioning model according to the initial rewards model; and adjusting the initial rewarding model according to the first rewarding parameters and the second rewarding parameters so as to obtain the preset rewarding model.

In this embodiment, the first reward parameter may specifically be a reward value that is manually set by a technician according to a preset reward rule for a preset questioning model (for example, an initial policy model) according to action policy data predicted by the label clicking operation data of the second sample user.

In this embodiment, the second reward parameter may specifically be a reward value determined by the initial reward model according to action policy data of a point predicted by the tag click operation data of the same second sample user for a preset question model.

Further, the server may compare the first reward parameter and the second reward parameter corresponding to the same predicted action policy data, continuously adjust the model parameters in the initial reward model according to the comparison result, until the difference value between the second reward parameter and the first reward parameter obtained based on the adjusted reward model is smaller than the preset difference threshold, determine that the current reward model meets the accuracy requirement, and determine the current reward model as the preset reward model.

In some embodiments, referring to fig. 5, the server may also use another method to build the get preset rewards model. Specifically, it may include: historical training records are obtained when the model is trained historically through reinforcement learning. Sample state data, sample action policy data, and sample reward data employed corresponding to the sample state data and sample action policy data are extracted from the historical training record. Further, the sample state data, the sample action policy data, and the sample reward data, which correspond to each other, may be used as a set of training data. In the manner described above, multiple sets of training data for training the reward model may be derived from the historical training record. And then, through model learning of the plurality of groups of training data, a preset rewarding model capable of determining proper rewarding data according to the state data and action strategy data corresponding to the state data is established.

In some embodiments, the tag may specifically include: name tags for services, name tags for operations in services, name tags for operations execution objects in services, and so forth. Of course, it should be noted that the labels listed above are only illustrative. In the implementation, other types and content tags can be introduced according to specific application scenes. The present specification is not limited to this.

The embodiment of the specification also provides a server, which comprises a processor and a memory for storing instructions executable by the processor, wherein the processor can execute the following steps according to the instructions when being implemented: acquiring click state data of a first sample user aiming at a current tag, and determining current action strategy data of a preset questioning model according to the click state data of the first sample user aiming at the current tag, wherein the current action strategy data comprises: clicking operation of the first sample user on the next group of labels, or putting forward a target problem by the first sample user; and calling a preset rewarding model, and determining rewarding data fed back to the preset questioning model according to the clicking state data of the first sample user aiming at the current label and the current action strategy data.

In order to more accurately complete the above instructions, referring to fig. 6, another specific server is provided in this embodiment of the present disclosure, where the server includes a network communication port 601, a processor 602, and a memory 603, and the above structures are connected by an internal cable, so that each structure may perform specific data interaction.

The network communication port 601 may be specifically configured to obtain click state data of a first sample user for a current tag, and current action policy data determined by a preset questioning model according to the click state data of the first sample user for the current tag, where the current action policy data includes: the first sample user may click on the next set of labels or the first sample user may ask a target question.

The processor 602 may be specifically configured to invoke a preset reward model to determine reward data fed back to a preset questioning model according to the click state data of the first sample user for the current tag and the current action policy data.

The memory 603 may be used for storing a corresponding program of instructions.

In this embodiment, the network communication port 601 may be a virtual port that binds with different communication protocols, so that different data may be sent or received. For example, the network communication port may be an 80 # port responsible for performing web data communication, a 21 # port responsible for performing FTP data communication, or a 25 # port responsible for performing mail data communication. The network communication port may also be an entity's communication interface or a communication chip. For example, it may be a wireless mobile network communication chip, such as GSM, CDMA, etc.; it may also be a Wifi chip; it may also be a bluetooth chip.

In this embodiment, the processor 602 may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor, and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a programmable logic controller, and an embedded microcontroller, among others. The description is not intended to be limiting.

In this embodiment, the memory 603 may include multiple levels, and in a digital system, the memory may be any memory as long as it can hold binary data; in an integrated circuit, a circuit with a memory function without a physical form is also called a memory, such as a RAM, a FIFO, etc.; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card, and the like.

The embodiments of the present specification also provide a computer storage medium storing computer program instructions that when executed implement a method for determining bonus data described above: acquiring click state data of a first sample user aiming at a current tag, and determining current action strategy data of a preset questioning model according to the click state data of the first sample user aiming at the current tag, wherein the current action strategy data comprises: clicking operation of the first sample user on the next group of labels, or putting forward a target problem by the first sample user; and calling a preset rewarding model, and determining rewarding data fed back to the preset questioning model according to the clicking state data of the first sample user aiming at the current label and the current action strategy data.

In the present embodiment, the storage medium includes, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.

In this embodiment, the functions and effects of the program instructions stored in the computer storage medium may be explained in comparison with other embodiments, and are not described herein.

The embodiment of the specification also provides a method for determining the rewarding data, which is particularly applied to the server side and can comprise the following when being particularly implemented.

S1: acquiring current state data and current action strategy data determined by a preset processing model according to the current state data;

s2: and calling a preset rewarding model to determine rewarding data fed back to a preset processing model according to the current state data and the current action strategy data.

In this embodiment, the current state data may specifically be data for characterizing the current state of the target object. For example, the click operation data of the displayed label can be the current click operation data of the user, the current city position of the passenger, the current city weather, the current city traffic and other environmental data, the current blood sugar value, the current blood pressure value and other health data of the patient and the like. Of course, the above listed current state data is only one schematic illustration. In the implementation, other content and type data can be introduced as current state data according to specific application scenes and processing requirements. The present specification is not limited to this.

In some embodiments, the current action policy data may be specifically understood as action data that may be taken by the target object predicted based on the processing policy in the next step based on the current state data. Wherein, the current action policy data corresponds to the current state data. For example, if the current state data is click operation data that the user is currently on the presented tab. Accordingly, the current action policy data may be click operation data of the user for the next set of presented tags, or a target question the user wants to ask. And if the current state data is the environment data such as the current city position of the passenger, the city weather, the city traffic and the like. Accordingly, the current action policy data may be the traffic pattern that the passenger will select next, and so on.

In some embodiments, the current state data may be input as a model input into a preset process model when implemented. And running the preset processing model to obtain corresponding model output as the current action strategy data corresponding to the current state data.

In some embodiments, the preset processing model may specifically include a model capable of predicting corresponding current action policy data according to current state data based on the learned processing policy.

In some embodiments, after calling a preset reward model to determine reward data fed back to a preset processing model according to the current state data and the current action policy data, when the method is implemented, the method may further include the following: and performing reinforcement learning on the preset processing model according to the reward data to obtain a preset processing model meeting the requirements, wherein the preset processing model meeting the requirements is used for determining an action strategy corresponding to the state data according to the state data.

Through the embodiment, a trained preset reward model can be used for replacing a technician, corresponding reward data is given out according to current action strategy data determined by the preset processing model based on current state data, and the corresponding reward data is fed back to the preset processing model, so that corresponding reinforcement learning can be carried out on the preset processing model according to the reward data, and a processing model with high accuracy and good effect is obtained.

Referring to fig. 7, on a software level, the embodiment of the present disclosure further provides a device for determining bonus data, where the device may specifically include the following structural modules.

The obtaining module 701 may be specifically configured to obtain click state data of a first sample user for a current tag, and current action policy data determined by a preset question model according to the click state data of the first sample user for the current tag, where the current action policy data includes: clicking operation of the first sample user on the next group of labels, or putting forward a target problem by the first sample user;

the determining module 703 may be specifically configured to invoke a preset reward model to determine reward data fed back to the preset questioning model according to the click state data of the first sample user for the current tag and the current action policy data.

In some embodiments, the apparatus may specifically further include a reinforcement learning module, which may specifically be configured to reinforcement learn the preset questioning model according to the reward data, so as to obtain a preset questioning model that meets requirements, where the preset questioning model meets requirements is used to predict a target problem of a user according to click operation data of the user for multiple groups of tags.

In some embodiments, the apparatus may specifically further include a building module, specifically configured to build a preset rewards model, where the building module may specifically include the following structural units:

The acquiring unit is specifically configured to acquire click operation data of the second sample user for a plurality of groups of tags, and a target problem of the second sample user as sample data;

the learning unit may be specifically configured to learn the sample data to obtain a preset reward model.

In some embodiments, the learning unit may be specifically configured to determine, according to the sample data, a plurality of reward parameters for a plurality of action policy data determined based on a preset question model; determining a jackpot according to the plurality of prize parameters; constructing a target loss function according to the accumulated rewards; and establishing the preset rewarding model according to the target loss function.

In some embodiments, the tag may specifically include: name tags for services, name tags for operations in services, name tags for operations execution objects in services, and so forth.

In some embodiments, the learning unit may be further configured to establish an initial reward model; determining a plurality of first rewards parameters for a plurality of action strategy data determined based on a preset questioning model according to the sample data and a preset rewards rule; determining a plurality of second rewards parameters for a plurality of action strategy data determined based on a preset questioning model according to the initial rewards model; and adjusting the initial rewarding model according to the first rewarding parameters and the second rewarding parameters so as to obtain the preset rewarding model.

It should be noted that, the units, devices, or modules described in the above embodiments may be implemented by a computer chip or entity, or may be implemented by a product having a certain function. For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, when the present description is implemented, the functions of each module may be implemented in the same piece or pieces of software and/or hardware, or a module that implements the same function may be implemented by a plurality of sub-modules or a combination of sub-units, or the like. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

From the above, in the determining device for bonus data provided in the embodiment of the present disclosure, the acquiring module acquires the click state data of the first sample user for the current tag, and the preset questioning model determines the current action policy data according to the click state data of the first sample user for the current tag; and calling a pre-trained preset reward model through a determining module, and determining reward data fed back to the preset question model according to click state data of the first sample user for the current label and current action strategy data so as to perform reinforcement learning on the preset question model. So that bonus data for reinforcement learning can be quickly and accurately acquired.

Although the present description provides method operational steps as described in the examples or flowcharts, more or fewer operational steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented by an apparatus or client product in practice, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment, or even in a distributed data processing environment). The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, it is not excluded that additional identical or equivalent elements may be present in a process, method, article, or apparatus that comprises a described element. The terms first, second, etc. are used to denote a name, but not any particular order.

Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller can be regarded as a hardware component, and means for implementing various functions included therein can also be regarded as a structure within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

From the above description of embodiments, it will be apparent to those skilled in the art that the present description may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present specification may be embodied essentially in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and include several instructions to cause a computer device (which may be a personal computer, a mobile terminal, a server, or a network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments of the present specification.

Various embodiments in this specification are described in a progressive manner, and identical or similar parts are all provided for each embodiment, each embodiment focusing on differences from other embodiments. The specification is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

Although the present specification has been described by way of example, it will be appreciated by those skilled in the art that there are many variations and modifications to the specification without departing from the spirit of the specification, and it is intended that the appended claims encompass such variations and modifications as do not depart from the spirit of the specification.

Claims

1. A method of determining bonus data, comprising:

acquiring click state data of a first sample user aiming at a current tag, and determining current action strategy data of a preset questioning model according to the click state data of the first sample user aiming at the current tag, wherein the current action strategy data comprises: clicking operation of the first sample user on the next group of labels, or putting forward a target problem by the first sample user;

invoking a preset rewarding model, and determining rewarding data fed back to a preset questioning model according to click state data of the first sample user aiming at a current tag and the current action strategy data; the preset rewarding model is obtained in the following mode: determining a plurality of rewards parameters for a plurality of action strategy data determined based on a preset questioning model according to the sample data; determining a jackpot according to the plurality of prize parameters; constructing a target loss function according to the accumulated rewards; and establishing the preset rewarding model according to the target loss function.

2. The method of claim 1, after invoking a preset reward model to determine reward data fed back to a preset questioning model according to click state data of the first sample user for a current tag and the current action policy data, the method further comprises:

and performing reinforcement learning on the preset questioning model according to the reward data to obtain a preset questioning model meeting the requirements, wherein the preset questioning model meeting the requirements is used for predicting target questions of the user according to click operation data of the user aiming at a plurality of groups of labels.

3. The method of claim 1, the pre-set reward model being obtained as follows:

acquiring click operation data of a second sample user aiming at a plurality of groups of labels, and taking target problems of the second sample user as sample data;

and learning the sample data to obtain a preset rewarding model.

4. The method of claim 1, the tag comprising: name label of service, name label of operation in service, and name label of operation execution object in service.

5. The method of claim 3, obtaining a predetermined reward model by learning the sample data, further comprising:

Establishing an initial rewarding model;

determining a plurality of first rewards parameters for a plurality of action strategy data determined based on a preset questioning model according to the sample data and a preset rewards rule;

determining a plurality of second rewards parameters for a plurality of action strategy data determined based on a preset questioning model according to the initial rewards model;

and adjusting the initial rewarding model according to the first rewarding parameters and the second rewarding parameters so as to obtain the preset rewarding model.

6. A method of determining bonus data, comprising:

acquiring current state data and current action strategy data determined by a preset processing model according to the current state data;

invoking a preset rewarding model to determine rewarding data fed back to a preset processing model according to the current state data and the current action strategy data; the preset rewarding model is obtained in the following mode: determining a plurality of rewards parameters for a plurality of action strategy data determined based on a preset questioning model according to the sample data; determining a jackpot according to the plurality of prize parameters; constructing a target loss function according to the accumulated rewards; and establishing the preset rewarding model according to the target loss function.

7. The method of claim 6, after invoking a preset reward model to determine reward data to be fed back to a preset process model according to the current state data and the current action policy data, the method further comprising:

and performing reinforcement learning on the preset processing model according to the reward data to obtain a preset processing model meeting the requirements, wherein the preset processing model meeting the requirements is used for determining an action strategy corresponding to the state data according to the state data.

8. A bonus data determining apparatus, comprising:

the system comprises an acquisition module, a query module and a query module, wherein the acquisition module is used for acquiring click state data of a first sample user for a current tag and current action strategy data determined by a preset questioning model according to the click state data of the first sample user for the current tag, and the current action strategy data comprises: clicking operation of the first sample user on the next group of labels, or putting forward a target problem by the first sample user;

the determining module is used for calling a preset rewarding model and determining rewarding data fed back to the preset questioning model according to the clicking state data of the first sample user aiming at the current label and the current action strategy data; the preset rewarding model is obtained in the following mode: determining a plurality of rewards parameters for a plurality of action strategy data determined based on a preset questioning model according to the sample data; determining a jackpot according to the plurality of prize parameters; constructing a target loss function according to the accumulated rewards; and establishing the preset rewarding model according to the target loss function.

9. The apparatus of claim 8, further comprising a reinforcement learning module configured to reinforcement learn the preset questioning model according to the reward data to obtain a satisfactory preset questioning model, wherein the satisfactory preset questioning model is configured to predict a target question of a user according to click operation data of the user for a plurality of groups of tags.

10. The apparatus of claim 9, further comprising a build module for building a preset rewards model, the build module comprising:

the acquisition unit is used for acquiring click operation data of the second sample user aiming at a plurality of groups of labels and target problems of the second sample user as sample data;

and the learning unit is used for learning the sample data to acquire a preset rewarding model.

11. The apparatus of claim 8, the tag comprising: name label of service, name label of operation in service, and name label of operation execution object in service.

12. The apparatus of claim 10, the learning unit being further specifically configured to build an initial rewards model; determining a plurality of first rewards parameters for a plurality of action strategy data determined based on a preset questioning model according to the sample data and a preset rewards rule; determining a plurality of second rewards parameters for a plurality of action strategy data determined based on a preset questioning model according to the initial rewards model; and adjusting the initial rewarding model according to the first rewarding parameters and the second rewarding parameters so as to obtain the preset rewarding model.

13. A server comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the steps of the method of any one of claims 1 to 5.

14. A computer readable storage medium having stored thereon computer instructions which when executed implement the steps of the method of any of claims 1 to 5.