CN116361537A

CN116361537A - Recommendation method, recommendation device and computer readable storage medium

Info

Publication number: CN116361537A
Application number: CN202111630471.7A
Authority: CN
Inventors: 闫岩; 陆海俊; 高恩伟
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2023-06-30

Abstract

The embodiment of the invention discloses a recommendation method, a recommendation device and a computer readable storage medium, wherein the recommendation method comprises the following steps: receiving target user state information; the target user state information comprises at least one of viewing state information, purchasing state information, browsing state information and collection state information; determining a real recommendation result by using a preset actor-critic recommendation model and target user state information; the preset actor-commentator recommendation model is obtained by constructing an actor-commentator model by using a shared network model, improving strategies made by the actor by using information entropy, adjusting confidence parameters of qualification traces, obtaining output values of the commentator recommendation model, and performing optimization training. By adopting the scheme, the whole history recommendation condition can be mastered in the recommendation process, the problems of obtaining the current interests or paying attention to the potential preferences of the user are balanced, and the accuracy of the final recommendation result is improved.

Description

Recommendation method, recommendation device and computer readable storage medium

Technical Field

The present invention relates to the field of recommendation application systems and algorithms, and in particular, to a recommendation method, apparatus, and computer readable storage medium.

Background

In the field of data mining, user behavior data is a group of very important data, and effective extraction of valuable information in the behavior data is an important premise for successful commodity recommendation. Most of the existing recommendation systems are based on machine learning methods, and the deep reinforcement learning algorithm is a popular method in the current recommendation algorithms. In the context of the big data age, how to abstract commonalities between data to ensure accuracy of feature extraction, and how to ensure that users' potential preferences are mined while recommending currently preferred products to the users is a big challenge for the algorithm.

However, the current method has certain defects in recommending the actual needs of the user, on one hand, the problems of neglecting the potential preference of the user, paying attention to the current benefit and neglecting the potential preference of the user cannot be balanced; on the other hand, the history recommendation condition cannot be grasped as a whole, and a proper ranking recommendation is made.

Disclosure of Invention

The embodiment of the invention provides a recommendation method, a recommendation device and a computer readable storage medium, which can ensure that the whole history recommendation condition can be mastered in the recommendation process, balance the problem of obtaining the current interests or paying attention to the potential preferences of users, and improve the accuracy of the final recommendation result.

The technical scheme of the invention is realized as follows:

the embodiment of the invention provides a recommendation method, which comprises the following steps:

receiving target user state information; the target user state information comprises at least one of viewing state information, purchasing state information, browsing state information and collecting state information;

determining a real recommendation result by using a preset actor-critic recommendation model and the target user state information; the preset actor-commentator recommendation model is obtained by constructing an actor-commentator model by using a shared network model, improving strategies made by the actor by using information entropy, adjusting confidence parameters of qualification traces, obtaining output values of the commentator recommendation model, and performing optimization training.

In the above solution, before determining the actual recommendation, the method further includes:

determining a training sample, wherein the training sample is randomly collected user state information;

processing the user state information by using an initial actor-critic recommendation model to obtain a sample recommendation result, a sample evaluation result and a sample actual reward;

Processing the sample recommendation result by using a sample information entropy to determine a sample real evaluation result;

adjusting parameters of a commentator recommendation model and an actor recommendation model based on a sample confidence parameter, the sample real evaluation result, the sample information entropy and the sample real rewards;

and determining the preset actor-critter recommendation model by respectively adjusting the critter model parameters and the actor model parameters.

In the above scheme, the processing the sample recommendation result by using the sample information entropy to determine the sample real evaluation result includes:

processing the sample recommendation result by using the sample information entropy to obtain a sample real recommendation result based on the information entropy;

and determining the sample real evaluation result by using the sample real recommendation result based on the information entropy.

In the above solution, the adjusting parameters of the commentator recommendation model and the actor recommendation model based on the sample confidence parameter, the sample real evaluation result, the sample information entropy and the sample actual reward includes:

determining a sample error according to the sample confidence parameter, the sample true evaluation result and the sample true reward;

Adjusting commentator recommendation model parameters by using the sample errors;

and adjusting actor recommendation model parameters by using the sample error and the sample information entropy.

In the above scheme, one round of recommendation comprises a plurality of moments; the time t and the time t+1 are any time in the plurality of times;

the processing the user state information by using an initial actor-critic recommendation model to obtain a recommendation result, an evaluation result and an actual prize comprises the following steps:

performing feature extraction and analysis processing on sample user state information at the moment t by using the initial actor-critic recommendation model, and determining a sample recommendation result at the moment t, a sample actual reward at the moment t and sample user state information at the moment t+1; the sample actual rewards comprise at least one of click quantity information, purchase amount information and browsing duration information for recommended content;

determining a sample evaluation result at the time t by using the sample recommendation result at the time t and the sample user state information at the time t;

and carrying out feature extraction and analysis processing on the sample user state information at the time t+1 by using the initial actor-critic recommendation model, and determining a sample recommendation result at the time t+1 and a sample evaluation result at the time t+1.

In the above scheme, the processing the sample recommendation result by using the sample information entropy to obtain a sample real recommendation result based on the information entropy includes:

determining the sample information entropy at the time t according to the sample user state information at the time t and the sample recommendation result at the time t;

determining a sample intermediate recommendation result at the time t according to the sample recommendation result at the time t, the sample information entropy at the time t and a sample preset factor;

and obtaining a real sample recommendation result based on the information entropy at the moment t according to the sample recommendation result at the moment t and the sample intermediate recommendation result at the moment t.

In the above scheme, the determining the sample real evaluation result by using the sample real recommendation result based on the information entropy includes:

and determining a sample real evaluation result at the time t according to the sample real recommendation result based on the information entropy at the time t and the sample user state information at the time t.

In the above scheme, the obtaining the real sample recommendation result based on the information entropy at the time t according to the sample recommendation result at the time t and the sample intermediate recommendation result at the time t includes:

determining a sample measurement distance according to the sample recommendation result at the time t and the sample intermediate recommendation result at the time t;

If the sample measurement distance is greater than or equal to a sample preset threshold, the sample intermediate recommendation result at the time t is a sample real recommendation result based on information entropy at the time t;

and if the sample measurement distance is smaller than the sample preset threshold, the sample recommendation result at the time t is the real sample recommendation result based on the information entropy at the time t.

In the above solution, the determining a sample error according to the sample confidence parameter, the sample true evaluation result and the sample true reward includes:

obtaining a sample real evaluation result based on qualification trace at t time through the sample preset round number and the sample confidence parameter;

determining a sample target value based on the sample real evaluation result based on the qualification trace at the t moment, the sample real evaluation result at the t+1 moment and the sample real reward;

and determining the sample error according to the sample real evaluation result based on the qualification trace at the t moment and the sample target value.

In the above solution, the determining the sample target value based on the sample real evaluation result based on the qualification trace at the t time, the sample real evaluation result at the t+1 time, and the sample real prize includes:

Determining a sample real evaluation result at the time t+1 according to the sample recommendation result at the time t+1 and the sample user state information at the time t+1;

multiplying the real sample evaluation result at the time t+1 with a preset discount parameter to determine a calculation result;

and adding the calculation result and the sample actual reward to obtain the sample target value.

In the above solution, the adjusting the actor recommendation model parameters by using the sample error and the sample information entropy includes:

determining an update gradient of the actor recommendation model parameters by using the sample error and the sample information entropy;

and updating the actor recommendation model parameters by using the gradient.

In the above solution, the determining the actual recommendation result by using the preset actor-critic recommendation model and the target user state information includes:

performing feature extraction and analysis processing on the target user state information by using the preset actor-critics recommendation model, and determining an initial recommendation result;

determining an intermediate recommendation result corresponding to the target user state information according to the initial recommendation result, the information entropy and the preset factor;

And determining a real recommendation result corresponding to the target user state information according to the initial recommendation result and the intermediate recommendation result corresponding to the target user state information.

In the above solution, the determining, according to the initial recommendation result and the intermediate recommendation result corresponding to the target user state information, the actual recommendation result corresponding to the target user state information includes:

determining a measurement distance according to the initial recommendation result and the intermediate recommendation result corresponding to the target user state information;

if the measurement distance is greater than or equal to a preset threshold value, the intermediate recommendation result corresponding to the target user state information is a real recommendation result corresponding to the target user state information;

and if the measurement distance is smaller than the preset threshold value, the initial recommendation result is a real recommendation result corresponding to the target user state information.

The embodiment of the invention provides a recommending device, which comprises a receiving unit and a determining unit; wherein, the liquid crystal display device comprises a liquid crystal display device,

the receiving unit is used for receiving the state information of the target user; the target user state information comprises at least one of viewing state information, purchasing state information, browsing state information and collecting state information;

The determining unit is used for determining a real recommendation result by using a preset actor-critic recommendation model and the target user state information; the preset actor-reviewer recommendation model is obtained by constructing an actor-reviewer model by using a shared network model, obtaining an output value of the reviewer recommendation model by using information entropy to carry out strategy improvement on the actor and confidence parameters for adjusting qualification traces, and carrying out optimization training.

The embodiment of the invention provides a recommending device, which comprises:

a memory for storing executable instructions;

the processor is configured to execute the executable instructions stored in the memory, and when the executable instructions are executed, the processor performs the recommended method.

Embodiments of the present invention provide a storage medium storing executable instructions that, when executed by one or more processors, perform the recommendation method.

The embodiment of the invention provides a recommendation method, a recommendation device and a computer readable storage medium, wherein the recommendation method comprises the steps of receiving target user state information; the target user state information comprises at least one of viewing state information, purchasing state information, browsing state information and collecting state information; determining a real recommendation result by using a preset actor-critic recommendation model and the target user state information; the preset actor-commentator recommendation model is obtained by constructing an actor-commentator model by using a shared network model, improving strategies made by the actor by using information entropy, adjusting confidence parameters of qualification traces, obtaining output values of the commentator recommendation model, and performing optimization training. By adopting the scheme, the whole history recommendation condition can be mastered in the recommendation process, the problems of obtaining the current interests or paying attention to the potential preferences of the user are balanced, and the accuracy of the final recommendation result is improved.

Drawings

FIG. 1 is a flowchart of a recommendation method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a generic recommendation algorithm according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an effect of a sharing parameter model based on actor-critics according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a second recommendation method according to an embodiment of the present invention;

FIG. 5 is a flowchart of a recommendation ordering algorithm using qualification traces according to an embodiment of the present invention;

fig. 6 is a schematic flow chart III of a recommendation method according to an embodiment of the present invention;

fig. 7 is a schematic flow chart of a recommendation method according to an embodiment of the present invention;

fig. 8 is a schematic flow chart fifth of a recommendation method according to an embodiment of the present invention;

FIG. 9 is a flowchart of a recommendation method according to an embodiment of the present invention;

fig. 10 is a schematic flow chart seven of a recommendation method according to an embodiment of the present invention;

FIG. 11 is a schematic flowchart eight of a recommendation method according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of an intra-scenario agent time-step feedback effect according to an embodiment of the present invention;

FIG. 13 is a schematic diagram of an agent history recommendation effect according to an embodiment of the present invention;

FIG. 14 is a schematic diagram of an inter-scenario effect according to an embodiment of the present invention;

FIG. 15 is a second schematic diagram of an inter-scenario effect according to an embodiment of the present invention;

FIG. 16 is a third schematic illustration of an intersymbol effect according to an embodiment of the present invention;

fig. 17 is a flowchart of a recommendation method according to an embodiment of the present invention;

fig. 18 is a schematic flow chart of a recommendation method according to an embodiment of the present invention;

FIG. 19 is a flowchart of a recommendation method according to an embodiment of the present invention;

fig. 20 is a schematic flowchart diagram of a recommendation method according to an embodiment of the present invention;

FIG. 21 is a flowchart of a recommendation method according to an embodiment of the present invention;

FIG. 22 is a schematic structural diagram of a recommendation device according to an embodiment of the present invention;

fig. 23 is a schematic structural diagram of a recommending apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without making any inventive effort are within the scope of the present invention.

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. Fig. 1 is a schematic flow chart of a recommendation method according to an embodiment of the present invention, and will be described with reference to the steps shown in fig. 1.

S101, receiving target user state information; the target user status information includes at least one of viewing status information, purchasing status information, browsing status information, and collection status information.

In the embodiment of the invention, the state information of the target user is acquired, and the state information of the target user can comprise at least one of which commodities are recently checked, which commodities are purchased, and which commodities are browsed and collected by the user.

It can be understood that in the embodiment of the invention, after the state information of the target user is obtained, the accuracy of the final recommendation result is improved by analyzing the state information of the user.

S102, determining a real recommendation result by using a preset actor-critics recommendation model and target user state information; the preset actor-commentator recommendation model is obtained by constructing an actor-commentator model by using a shared network model, improving strategies made by the actor by using information entropy, adjusting confidence parameters of qualification traces, obtaining output values of the commentator recommendation model, and performing optimization training.

In the embodiment of the invention, the user state information is used as input and is input into a trained actor-criticizer recommendation model, the characteristics of input data are extracted by using the model for analysis, and finally, a corresponding recommendation result is output.

In the embodiment of the invention, the recommendation result can be daily necessities recommendation, movie recommendation, scoring recommendation and the like, and the invention is not limited.

In the embodiment of the invention, the general recommended algorithm flow is shown in fig. 2, and the data is firstly extracted, cleaned and converted into regular and effective media by a big data end. And then, in the recall behavior, coarse ranking is carried out according to a recall algorithm of the business design to form a recall candidate set, and refined ranking is carried out in the candidate set to obtain a final recommendation result. The data warehouse technology (ETL) is a process of loading data of a business system into a data warehouse after extraction, cleaning and conversion, and aims to integrate scattered, scattered and non-uniform data in enterprises together so as to provide analysis basis for decisions of the enterprises.

In the embodiment of the invention, the actor-reviewer recommendation model is improved in a training stage, after the original reviewer network and the actor network input data, the characteristics of the original reviewer network and the original actor network are extracted and analyzed through the convolution layer, and the final result is output, so that the two networks are improved in the invention, and the two networks share the convolution layer, as shown in fig. 3. The method comprises the steps of inputting user state information, performing convolution processing to obtain a 32-channel feature map, wherein the image feature of the feature map is 20-20 pixels, performing convolution processing to the 32-channel feature map to obtain a 64-channel feature map, the image feature of the feature map is 9*9 pixels, performing convolution processing to the feature map of each channel to obtain a feature map of a pixel 7*7, sequentially performing identical processing to the 64-channel feature map to obtain a new 64-channel feature map, performing convolution processing to the new 64-channel feature map to obtain 512 nerve units, performing full connection processing to the 512 nerve units, performing shunt in a full connection network to obtain actor and criticism data, obtaining a recommendation model meeting the conditions after model training, processing the input data by using the model, and outputting corresponding recommendation.

In the embodiment of the invention, after the model is built, the network weights need to be trained by using back propagation, and the traditional strategy is to update the network weights by using an approximate strategy optimization algorithm, unlike the method and the device adopt a qualification trace idea to solve the optimization problem. Traditional qualification trace algorithms can only solve the relationship between one interaction, and if internal links are contained in a plurality of episodes, the association cannot be mined. Unlike traditional qualification traces, the inventive idea of using qualification traces between episodes makes specific recommendations taking into account agent history recommended.

In the embodiment of the invention, in a recommendation algorithm, if only the optimal item of the current sequencing result is maintained and the potential preference of the user is ignored, other potential likes of the user are sacrificed in the background of interest drift of the user, potential objects possibly bring huge benefits in the future, only focusing on the current optimal recommendation will cause short-term "of the intelligent agent, too much focusing on long-term rewards will lose the current optimal rewards, and the potential rewards and the current selection ratio are not necessarily optimal, so that balance between 'exploration' and 'utilization' is necessary. The traditional reinforcement learning method is based on a multi-arm gambling machine algorithm so as to ensure the overall exploration and utilization. However, these algorithms only make small oscillations in the action space, and the actions made cannot permanently affect the agent for a long time, i.e. the exploration ability needs to be further improved. The balance of exploration and utilization is maintained through the information entropy, and the ability of exploration or utilization in the current state is automatically determined according to the size of the information entropy.

It can be understood that in the embodiment of the invention, the initial actor-criticizer model is improved by utilizing the information entropy, the qualification trace and the network structure, and the input user state information is processed by using the trained model to obtain the corresponding recommendation result, thereby improving the success rate of recommendation for the user.

In some embodiments of the present invention, referring to fig. 4, fig. 4 is a schematic flow chart of a recommendation method provided for an embodiment of the present invention, and S201 to S205 shown in fig. 4 are training processes of an actor-reviewer recommendation model, and will be described with reference to the steps.

S201, determining a training sample, wherein the training sample is randomly collected user state information.

In some embodiments of the present invention, user state information is randomly gathered as training samples prior to training the model.

S202, processing the user state information by using an initial actor-criticizer recommendation model to obtain a sample recommendation result, a sample evaluation result and a sample actual reward.

In some embodiments of the present invention, the initial actor-critics model is utilized to process the input user state information, the actor network outputs the recommended result, and the critics network outputs the evaluation result; and the actual rewards are actual rewards brought by operations made by the user according to the recommended results after the results are recommended to the user.

S203, processing the sample recommendation result by using the sample information entropy to determine the sample real evaluation result.

In some embodiments of the invention, it is applicable to determining the scenario of the real evaluation result.

In some embodiments of the present invention, after the initial actor-critics network outputs the recommended result and the corresponding evaluation, the information entropy is added to judge the recommended result, so as to determine whether to maintain the current recommended result or continue to search for a better recommended result. And after determining the final real recommended result, determining a real evaluation result according to the final recommended result.

It can be appreciated that in some embodiments of the present invention, the obtained recommendation result is processed by using the information entropy to obtain a final real recommendation result, and a real evaluation result is obtained according to the real recommendation result, so that the evaluation of the recommendation result is more accurate, and the accuracy of the recommendation result is improved.

S204, adjusting parameters of the commentator recommendation model and the actor recommendation model based on the sample confidence parameters, the sample real evaluation results, the sample information entropy and the sample real rewards.

In some embodiments of the invention, it is applicable to scenes in which model parameters are adjusted.

In some embodiments of the invention, parameters of the reviewer network and actor network are adjusted based on the obtained true evaluation results, information entropy, and confidence parameters of the actual rewards and newly introduced qualifications, respectively.

It will be appreciated that in some embodiments of the present invention, parameters of the reviewer network and actor network are adjusted based on the obtained true evaluation results, information entropy, and confidence parameters of the actual rewards and newly introduced qualification tracks, respectively, so that training results of the model are more accurate.

S205, determining a preset actor-reviewer recommendation model by respectively adjusting reviewer model parameters and actor model parameters.

In some embodiments of the invention, it is applicable to the scenario of acquiring a trained model.

In some embodiments of the present invention, the overall model training process is illustrated in fig. 5 by continuously adjusting the initial reviewer model parameters and parameters of the actor model until an actor-reviewer recommendation model is obtained with accuracy that meets the criteria.

In some embodiments of the present invention, referring to fig. 6, fig. 6 is a flowchart of a recommended method provided for an embodiment of the present invention, S203 shown in fig. 6 may be implemented by S2031 to S2032, and the description will be made with reference to the steps.

S2031, processing the sample recommendation result by using the sample information entropy to obtain a sample real recommendation result based on the information entropy.

In some embodiments of the present invention, it is applicable to determining scenes of real recommended results.

In some embodiments of the present invention, after obtaining the recommendation result by using the actor network, the obtained recommendation result is processed according to the information entropy, and the real recommendation result based on the information entropy is determined.

It can be appreciated that in some embodiments of the present invention, after the recommended result is obtained, the information entropy processing is used to obtain the final real recommended result, so that the accuracy of the final recommended result is improved.

S2032, determining a sample real evaluation result by using the sample real recommendation result based on the information entropy.

In some embodiments of the present invention, after obtaining a real recommendation result based on the information entropy, a real evaluation result is obtained from the real recommendation result.

In some embodiments of the invention, the evaluation result is obtained based on the recommendation result.

In some embodiments of the present invention, referring to fig. 7, fig. 7 is a flowchart of a recommended method provided for the embodiment of the present invention, and S204 shown in fig. 7 may be implemented by S2041 to S2043, and the description will be made with reference to the steps.

S2041, determining a sample error according to the sample confidence parameter, the sample real evaluation result and the sample real prize.

In some embodiments of the present invention, after the actual evaluation and actual rewards are obtained, model predicted values and target values are obtained in combination with confidence parameters, and errors are determined based on the model predicted values and target values.

In some embodiments of the present invention, the confidence parameter λ represents a confidence weight for the recent recommendation, which has a value range of [0,1]. When lambda is 0, only the current round sequencing result is recommended, no evaluation is made on the historical recommendation track, and when lambda is 1, the total historical recommendation scenario is comprehensively considered.

S2042, adjusting commentator recommendation model parameters by using the sample errors.

In some embodiments of the invention, after the error is obtained, the error obtained by the root adjusts parameters of the reviewer model.

It will be appreciated that in some embodiments of the present invention, the error is determined according to the predicted value and the target value of the reviewer model, and the parameters of the model are adjusted by using the error, so that the prediction accuracy of the model is higher.

S2043, adjusting actor recommendation model parameters by using the sample error and the sample information entropy.

In some embodiments of the present invention, the error between the predicted value and the target value is determined after the predicted value and the target value are obtained according to the critic model, and the parameters of the actor model are updated according to the obtained error and the information entropy.

In some embodiments of the present invention, referring to fig. 8, fig. 8 is a flowchart of a recommended method provided for the embodiment of the present invention, and S202 shown in fig. 8 may be implemented by S2021 to S2023, and the description will be made with reference to the steps.

S2021, carrying out feature extraction and analysis processing on sample user state information at the moment t by using an initial actor-criticizer recommendation model, and determining a sample recommendation result at the moment t, a sample actual reward at the moment t and sample user state information at the moment t+1; the sample actual reward includes at least one of click amount information, purchase amount information, and browse duration information for the recommended content.

In some embodiments of the present invention, a round of interaction includes a plurality of time instants, and parameter updating of the criticizing network needs to be performed from any time instant to a time instant next to the time instant until one round is finished, and then training is performed for a second round until the expected number of rounds is reached or the training is completed and converged.

In some embodiments of the present invention, the state information of the user at the time t is obtained, the initial actor-critic model is used to perform feature extraction and analysis on the input state information of the user at the time t, and finally the recommendation result at the time t, the actual reward at the time t and the state information of the user at the time t+1 are output. And the actual rewards are actual rewards brought by operations made by the user according to the recommended results after the results are recommended to the user.

S2022, determining a sample evaluation result at the time t by using the sample recommendation result at the time t and the sample user state information at the time t.

In some embodiments of the present invention, a recommendation result at time t and user state information at time t are obtained, and an evaluation result of time t for the recommendation result is obtained by using the recommendation result and the user state information.

S2023, performing feature extraction and analysis processing on the sample user state information at the time t+1 by using an initial actor-criticizer recommendation model, and determining a sample recommendation result at the time t+1 and a sample evaluation result at the time t+1.

In some embodiments of the present invention, the method is suitable for determining scenes of recommended results and evaluation results at different moments.

In some embodiments of the present invention, the state information at the time t+1 of the user is used as input of a preset actor-criticizer model, and the model is used to process the input to determine the recommendation result at the time t+1 and the evaluation result at the time t+1.

In some embodiments of the present invention, referring to fig. 9, fig. 9 is a flowchart of a recommended method provided for an embodiment of the present invention, and S2031 shown in fig. 9 may be implemented by S301 to S303, and will be described in connection with each step.

S301, determining the entropy of sample information at the time t according to the sample user state information at the time t and the sample recommendation result at the time t.

In some embodiments of the present invention, after the state information and the recommendation result at the time t are obtained, the information entropy of the state at the time t is determined according to the state information and the recommendation result.

In some embodiments of the present invention, equation (1) for calculating the entropy of information is as follows:

H _π (st)＝-∑π(st,at)logπ(st,at) (1)

wherein H is _π (s _t ) The information entropy obtained at the time t; s is(s) _t The state at the moment of the user t; a, a _t And (5) recommending a result at the time t.

S302, determining a sample intermediate recommendation result at the time t according to the sample recommendation result at the time t, the sample information entropy at the time t and a sample preset factor.

In some embodiments of the invention, it is applicable to determining the scenes of the intermediate recommendations at different times.

In some embodiments of the present invention, after obtaining the recommended result, the information entropy and the preset factor at the time t, the intermediate recommended result of the state at the time t is determined according to the obtained value.

In some embodiments of the present invention, equation (2) for calculating the intermediate recommendation for the state at time t is as follows:

π ^* (s _t )＝π(s _t )+βH _π (s _t ) (2)

wherein pi ^* (s _t ) The result is a middle recommendation result of the state at the moment t; beta is a regulating factor; pi(s) _t ) The output of the actor network in the network is shared for time t.

And S303, obtaining a real sample recommendation result based on the information entropy at the moment t according to the sample recommendation result at the moment t and the sample intermediate recommendation result at the moment t.

In some embodiments of the invention, it is useful to determine real recommendations based on information entropy at different times.

In some embodiments of the present invention, a recommendation result at time t and an intermediate recommendation result of the state at time t are obtained, and a real recommendation result based on information entropy at time t is determined.

In some embodiments of the present invention, S2032 may be implemented by S401, and will be described in connection with the steps.

S401, determining a sample real evaluation result at the moment t according to a sample real recommendation result based on information entropy at the moment t and sample user state information at the moment t.

In some embodiments of the invention, it is applicable to determining the scenes of the real evaluation results at different moments.

In some embodiments of the present invention, after obtaining a real recommendation result based on information entropy at time t, a real evaluation result at time t is obtained in combination with state information of a user at time t.

In some embodiments of the present invention, referring to fig. 10, fig. 10 is a flowchart of a recommended method provided for an embodiment of the present invention, and S303 shown in fig. 10 may be implemented by S3031 to S3032, and the description will be made in connection with each step.

S3031, determining a sample measurement distance according to the sample recommendation result at the time t and the sample intermediate recommendation result at the time t.

In some embodiments of the invention, it is applicable to determining a scene of measuring distance.

In some embodiments of the present invention, after the recommended result at time t and the intermediate recommended result in the state at time t are obtained, the metric distance is determined according to the obtained two values.

In some embodiments of the present invention, equation (3) for determining the metric distance is as follows:

D＝KL(π ^* ||π) (3)

wherein D is the measurement distance.

S3032, if the sample measurement distance is greater than or equal to a sample preset threshold, the sample intermediate recommendation result at the time t is a sample real recommendation result based on the information entropy at the time t.

In some embodiments of the present invention, the measured distance is compared with a preset threshold value, and if the measured distance is greater than or equal to the threshold value, the intermediate recommendation result of the state at the time t is a real recommendation result based on the information entropy at the time t.

In some embodiments of the present invention, equation (4) for determining the actual recommendation is as follows:

wherein pi' is a real recommendation result; delta is the super parameter and can be corrected according to actual needs.

S3033, if the sample measurement distance is smaller than the sample preset threshold, the sample recommendation result at the time t is a real sample recommendation result based on the information entropy at the time t.

In some embodiments of the present invention, the measured distance is compared with a preset threshold value, and if the measured distance is smaller than the preset threshold value, the recommended result at the time t is a real recommended result based on information entropy at the time t.

It can be understood that, in some embodiments of the present invention, when the metric distance D is greater than or equal to the threshold value, the recommended result given by the policy function is greatly different from the exploration result given by the entropy, and the exploration should be supported, and then the action policy is; otherwise, if the measured distance D is smaller than the threshold value, it is indicated that the difference between the explored result and the result given by the policy function is not large, and the exploration is not deep enough, so that the original recommended result in the state needs to be kept to maintain the utilization of the current optimal policy. And judging a final recommended result according to a preset threshold value and a measurement distance, so that the accuracy of the recommended result is improved.

In some embodiments of the present invention, referring to fig. 11, fig. 11 is a schematic flowchart eight of a recommended method provided for the embodiment of the present invention, and S2041 shown in fig. 11 may be implemented by S401 to S403, and the description will be made in connection with the steps.

S401, obtaining a sample real evaluation result based on qualification trace at t time through the sample preset round number and the sample confidence parameter.

In some embodiments of the invention, it is applicable to determining a scenario of a true evaluation result based on eligibility trace.

In some embodiments of the present invention, the actual evaluation result based on the qualification trace in the state of t time is obtained through the preset number of rounds and the adjustment of the confidence parameters of the qualification trace.

In some embodiments of the present invention, the conventional qualification trace calculates the weight of each time step in a scenario and updates with multi-step rewards feedback within a scenario, FIG. 12 is an intra-scenario (one-round interaction) agent time section feedback diagram. T and T are the starting and ending times, respectively, λ is the weight coefficient, 1- λ is the weight coefficient of 1-step, and (1- λ) λ is the weight coefficient of 2-step, and so on, the sum of the weights of each fusion term is 1. The open circles represent states in a plot, the filled dots represent actions, the squares represent termination states, each pipeline represents steps of time taken in a plot, i.e., in a plot, the proportion of each step of time is calculated, propagated forward using the proportion and updated using alternate tracks.

In some embodiments of the present invention, the concept of using qualification traces between episodes allows specific recommendations to be made in consideration of the history of the agent, and in addition, the present invention uses a gradient descent method to feed the results of qualification trace calculation back to the shared weight network instead of the substitute trace, so that learning of the agent is easier. FIG. 13 is one possible scenario where an agent history recommendation example appears, and r is the number of rounds. Fig. 13 shows one case of integrating the near 2 (round) recommendation value and the near 3 (round) recommendation value, each of which accounts for 1/2 of the current recommendation weight. One pipeline in the horizontal direction in the figure represents one round of recommendation (one scenario). Open circles represent states, filled black circles represent recommended results, a pipeline represents a complete recommended flow of episodes, and actions are obtained by the shared network (actor-critique recommendation model). Fig. 13 is an example of agent history recommendation, and fig. 14, 15 and 16 are decision processes between different scenarios, respectively. FIG. 14 shows that the coefficient of 1-step feedback value is 1- λ, FIG. 15 shows that the coefficient of 2-step feedback value is (1- λ), and FIG. 16 shows that the coefficient of n-step feedback value is λ ^T-t-1 In connection with fig. 14, 15 and 16, since an infinite number of n-step rewards are fused in one round of interaction, and secondly to ensure that the sum of the weights of each fused item is 1, the weight of each item is proportional to λ ^n-1 And finally multiplying by 1- λ, the formula (5) of the actual evaluation result based on the eligibility trace is finally calculated as follows:

wherein lambda is a confidence parameter, and the value range is [0,1]; e and n are the number of rounds.

In some embodiments of the invention, n may be set manually. When e is equal to 2, from round 2 to round 2+n, the relation of the rounds is added to the formula calculation, and finally, an R value can be obtained as the value of round 2, so that the value of round 2 takes the result of the number of rounds 2+n into consideration. After the number of rounds is specified, the value calculated by the formula is the true evaluation result based on the qualification trace in the state of the moment t of the current number of rounds.

S402, determining a sample target value based on a sample real evaluation result based on the qualification trace at the t moment, a sample real evaluation result at the t+1 moment and a sample actual reward.

In some embodiments of the present invention, after the actual evaluation result based on the qualification trace in the state of t time is obtained by S401, the target value is determined in combination with the actual evaluation result and the actual prize at time t+1.

In some embodiments of the present invention, equation (6) for determining the target value is as follows:

wherein r is _t Is the actual reward in the state of t moment; r is R ^π The output of the critics network at the time t+1; gamma is the discount parameter.

S403, determining a sample error according to the sample real evaluation result and the sample target value based on the qualification trace at the t moment.

In some embodiments of the present invention, the error is determined based on the actual evaluation result and the target value based on the qualification trace in the state at time t. Equation (7) is shown below:

wherein θ is a shared network parameter, mean square error is selected as a loss function, R _e ^λ To a multi-step result using qualification traces.

It can be appreciated that in some embodiments of the present invention, the field of view of the agent is effectively controlled by adjusting the confidence parameters of the qualification trace to ensure that the agent can overall grasp of the history recommendation situation when making the recommendation, and make a more refined ranking recommendation.

In some embodiments of the present invention, referring to fig. 17, fig. 17 is a flowchart of a recommended method provided for an embodiment of the present invention, and S402 shown in fig. 17 may be implemented by S4021 to S4023, and will be described in connection with the steps.

S4021, determining a sample real evaluation result at the time t+1 according to a sample recommendation result at the time t+1 and sample user state information at the time t+1.

In some embodiments of the present invention, the actual evaluation result at time t+1 is obtained according to the recommendation result at time t+1 and the state information at time t+1, as R in formula (6) ^π 。

S4022, multiplying the real sample evaluation result at the time t+1 by a preset discount parameter to determine a calculation result.

In some embodiments of the invention, it is applicable to determining the scenario of the calculation result.

In some embodiments of the present invention, the real evaluation result at time t+1 and the preset discount parameter are multiplied to determine the calculation result, and the specific operation is shown in formula (6).

S4023, adding the calculation result and the actual prize of the sample to obtain a target value of the sample.

In some embodiments of the invention, it is applicable to a scenario where a target value is obtained.

In some embodiments of the present invention, after the calculation result is obtained, the calculation result and the actual prize are added to obtain the final target value, and the specific operation is shown in formula (6).

In some embodiments of the present invention, referring to fig. 18, fig. 18 is a schematic flow chart of a recommended method provided for an embodiment of the present invention, and S2043 shown in fig. 18 may be implemented by S501 to S502, and the description will be made with reference to the steps.

S501, determining an update gradient of the actor recommendation model parameters by using the sample error and the sample information entropy.

In some embodiments of the invention, it is applicable to scenes in which gradients are determined.

In some embodiments of the present invention, the error and the information entropy are obtained, and then derived to obtain an updated gradient, and the obtained gradient is used to update the actor recommendation model parameters. The calculation formula (8) is as follows:

wherein θ is a shared network parameter; h _π (s _t ) Is the information entropy.

S502, updating the recommended model parameters of the actor by using the gradient.

In some embodiments of the invention, it is applicable to scenarios where models are updated.

In some embodiments of the invention, after the gradient is obtained in S501, parameters of the actor network are updated according to the gradient.

In some embodiments of the present invention, referring to fig. 19, fig. 19 is an eleventh flowchart of a recommended method provided for an embodiment of the present invention, S102 shown in fig. 19 may be implemented by S1021 to S1023, and the description will be made in connection with the steps.

S1021, performing feature extraction and analysis processing on the target user state information by using a preset actor-critics recommendation model, and determining an initial recommendation result.

In some embodiments of the present invention, it is applicable to determining the context of the initial recommendation.

In some embodiments of the present invention, after obtaining a trained actor-criticizer recommendation model, the model is used to perform feature extraction and analysis processing on the target user state information, and an initial recommendation result is determined, where the process is similar to the training process.

S1022, determining an intermediate recommendation result corresponding to the target user state information according to the initial recommendation result, the information entropy and the preset factor.

In some embodiments of the invention, it is applicable to determining a scenario of an intermediate recommendation.

In some embodiments of the present invention, the intermediate recommendation corresponding to the target user state information is determined based on the initial recommendation, the information entropy, and the preset factor, and the process is similar to the training process.

S1023, determining a real recommendation result corresponding to the target user state information according to the initial recommendation result and the intermediate recommendation result corresponding to the target user state information.

In some embodiments of the present invention, it is applicable to a process of determining a true recommendation result.

In some embodiments of the present invention, an intermediate recommendation corresponding to the initial recommendation and the target user status information is obtained, and a real recommendation corresponding to the target user status information is determined, which is similar to the training process.

In some embodiments of the present invention, referring to fig. 20, fig. 20 is a schematic flowchart showing a recommended method provided for the embodiment of the present invention, and S1023 shown in fig. 20 may be implemented by S10231 to S10233, and the description will be made with reference to the steps.

S10231, determining a measurement distance according to the initial recommendation result and the intermediate recommendation result corresponding to the target user state information.

In some embodiments of the present invention, the metric distance is determined from the initial recommendation and the intermediate recommendation corresponding to the target user state information, as shown in equation (3).

S10232, if the measurement distance is greater than or equal to a preset threshold value, the intermediate recommendation result corresponding to the target user state information is a real recommendation result corresponding to the target user state information.

In some embodiments of the present invention, the measured distance is compared with a preset threshold, and if the measured distance is greater than or equal to the preset threshold, the intermediate recommendation result corresponding to the target user state information is the actual recommendation result corresponding to the target user state information, and the process is similar to the training process.

S10233, if the measurement distance is smaller than a preset threshold value, the initial recommendation result is a real recommendation result corresponding to the target user state information.

In some embodiments of the present invention, the measured distance is compared with a preset threshold, and if the measured distance is smaller than the preset threshold, the initial recommendation result is a real recommendation result corresponding to the state information of the target user, and the process is similar to the training process.

It can be appreciated that in some embodiments of the present invention, the existing techniques cannot balance the problems of ignoring the user's potential preference, focusing on the current benefit and focusing on the current benefit, ignoring the user's potential preference, and cannot make a proper ranking recommendation for the overall grasp of the history recommendation situation when the recommendation user actually needs. According to the invention, by introducing qualification trace and information entropy into the model, the history recommendation condition can be integrally mastered in the recommendation process, and the problems of current interests and potential preference of users are balanced.

The embodiment of the invention provides a recommendation method, and an optional flow chart is shown in fig. 21.

S1, constructing a shared parameter network model, and initializing a field of view N of an actor, a critic network and an intelligent agent in the shared network;

In some embodiments of the present invention, MDP model modeling in reinforcement learning is used, and a complete MDP model consists of state, action, rewards, migration probability quadruples. Therefore, the four points around which the MDP modeling is used need to be expanded.

State S: according to the Oncomelanz razor principle, the state is designed into a matrix, the rows of the matrix represent the ids of users, the columns of the matrix represent the ids of products, and the value of each element in the matrix is taken as a characteristic vector.

Action a: and recalling TOPN candidate result combinations in the candidate set.

Prize R: for the recommendation result fed back to the user, the prize is 1 if the click action occurs in the TOPN, otherwise it is 0.

Migration probability F: if the user clicks a certain product in the recommended results, selecting the commodity under the product category with a certain probability e, and recommending the ordered results with a probability of 1-e.

S2, inputting the current time state of the user in the network, performing forward propagation, and outputting an actor result and a criticist result;

s3, calculating a real action result based on the information entropy;

s4, judging whether the round of interaction is completed or not;

s5, calculating a historical accumulated return value based on the qualification trace if the interaction is completed, and continuing the interaction if the interaction is not completed;

S6, calculating a reverse transmission gradient at the current moment, and carrying out reverse propagation on the shared network by using a gradient descent algorithm;

s7, circulating S2-S5 until the expected scenario number is reached or training is converged.

It can be understood that the obtained user state information is subjected to feature extraction and analysis by using a preset actor-critics recommendation model, and a corresponding recommendation result is obtained. According to the invention, by introducing qualification trace and information entropy into the model, the history recommendation condition can be integrally mastered in the recommendation process, and the problems of current interests and potential preference of users are balanced.

An embodiment of the present invention provides a structural schematic diagram of a recommending apparatus, as shown in fig. 22, and fig. 22 is a structural schematic diagram of a recommending apparatus provided in an embodiment of the present invention, where the recommending apparatus includes: a receiving unit 2201, a determining unit 2202; wherein, the liquid crystal display device comprises a liquid crystal display device,

a receiving unit 2201 for receiving target user status information; the target user status information includes at least one of viewing status information, purchasing status information, browsing status information, and collection status information.

A determining unit 2202, configured to determine a real recommendation result using a preset actor-critic recommendation model and the target user state information; the preset actor-reviewer recommendation model is obtained by constructing an actor-reviewer model by using a shared network model, obtaining an output value of the reviewer recommendation model by using information entropy to carry out strategy improvement on the actor and confidence parameters for adjusting qualification traces, and carrying out optimization training.

In some embodiments of the present invention, the determining unit 2202 is further configured to determine a training sample, where the training sample is randomly collected user state information; processing the user state information by using an initial actor-critic recommendation model to obtain a sample recommendation result, a sample evaluation result and a sample actual reward; processing the sample recommendation result by using a sample information entropy to determine a sample real evaluation result; based on sample confidence parameters, the sample real evaluation results, the sample information entropy and the sample real rewards, parameters of a commentator recommendation model and an actor recommendation model are adjusted; and determining the preset actor-reviewer recommendation model by adjusting the reviewer model parameters and actor model parameters respectively.

In some embodiments of the present invention, the determining unit 2202 is further configured to process the recommendation result by using the information entropy to obtain a real recommendation result based on the information entropy; and determining a real evaluation result by using the real recommendation result based on the information entropy.

In some embodiments of the present invention, the determining unit 2202 is further configured to process the sample recommendation result by using the sample information entropy to obtain a sample real recommendation result based on the information entropy; and determining a sample real evaluation result by using the sample real recommendation result based on the information entropy.

In some embodiments of the present invention, the determining unit 2202 is further configured to determine a sample error according to the sample confidence parameter, the sample true evaluation result, and the sample true reward; adjusting commentator recommendation model parameters by using the sample errors; and adjusting actor recommendation model parameters by using the sample error and the sample information entropy.

In some embodiments of the invention, a round of recommendations contains a plurality of moments; the time t and the time t+1 are any time in the plurality of times; the determining unit 2202 is further configured to perform feature extraction and analysis processing on the sample user state information at the time t by using the initial actor-critic recommendation model, and determine a sample recommendation result at the time t, a sample actual prize at the time t, and sample user state information at the time t+1; the sample actual rewards comprise at least one of click quantity information, purchase amount information and browsing duration information for recommended content; determining a sample evaluation result at the time t by using the sample recommendation result at the time t and the sample user state information at the time t; and carrying out feature extraction and analysis processing on the sample user state information at the time t+1 by using the initial actor-critics recommendation model, and determining a sample recommendation result at the time t+1 and a sample evaluation result at the time t+1.

In some embodiments of the present invention, the determining unit 2202 is further configured to determine, according to the sample user status information at the time t and the sample recommendation result at the time t, a sample information entropy at the time t; determining a sample intermediate recommendation result at the time t according to the sample recommendation result at the time t, the sample information entropy at the time t and a sample preset factor; and obtaining a real sample recommendation result based on the information entropy at the moment t according to the sample recommendation result at the moment t and the sample intermediate recommendation result at the moment t.

In some embodiments of the present invention, the determining unit 2202 is further configured to determine a sample real evaluation result at time t according to the sample real recommendation result at time t based on information entropy and the sample user status information at time t.

In some embodiments of the present invention, the determining unit 2202 is further configured to

Determining a sample measurement distance according to the sample recommendation result at the time t and the sample intermediate recommendation result at the time t; if the sample measurement distance is greater than or equal to a sample preset threshold, the sample intermediate recommendation result at the time t is a sample real recommendation result based on information entropy at the time t; and if the sample measurement distance is smaller than the sample preset threshold, the sample recommendation result at the time t is the sample real recommendation result based on the information entropy at the time t.

In some embodiments of the present invention, the determining unit 2202 is further configured to obtain, through a sample preset number of rounds and the sample confidence parameter, a sample real evaluation result based on the qualification trace at the t time; determining a sample target value based on the sample real evaluation result based on the qualification trace at the t moment, the sample real evaluation result at the t+1 moment and the sample real reward; and determining the sample error according to the sample real evaluation result based on the qualification trace at the t moment and the sample target value.

In some embodiments of the present invention, the determining unit 2202 is further configured to determine a sample real evaluation result at time t+1 according to the sample recommendation result at time t+1 and the sample user status information at time t+1; multiplying the real sample evaluation result at the time t+1 with a preset discount parameter to determine a calculation result; and adding the calculation result and the actual prize of the sample to obtain the target value of the sample.

In some embodiments of the present invention, the determining unit 2202 is further configured to determine an update gradient of the actor recommendation model parameter using the sample error and the sample information entropy; and updating the actor recommendation model parameters by using the gradient.

In some embodiments of the present invention, the determining unit 2202 is further configured to perform feature extraction and analysis processing on the target user status information by using a preset actor-critic recommendation model, to determine an initial recommendation result; determining an intermediate recommendation result corresponding to the target user state information according to the initial recommendation result, the information entropy and a preset factor; and determining a real recommendation result corresponding to the target user state information according to the initial recommendation result and the intermediate recommendation result corresponding to the target user state information.

In some embodiments of the present invention, the determining unit 2202 is further configured to determine a metric distance according to the initial recommendation result and the intermediate recommendation result corresponding to the target user status information; if the measurement distance is greater than or equal to a preset threshold value, the intermediate recommendation result corresponding to the target user state information is a real recommendation result corresponding to the target user state information; and if the measurement distance is smaller than the preset threshold value, the initial recommendation result is a real recommendation result corresponding to the target user state information.

It can be understood that in the implementation scheme of the device, the obtained user state information is subjected to feature extraction and analysis by using a preset actor-critic recommendation model, and a corresponding recommendation result is obtained. By introducing qualification trace and information entropy into the model, the history recommendation condition can be integrally mastered in the recommendation process, the problems of current interests and potential preference of users are balanced, and the accuracy of the final recommendation result is improved.

Based on the method of the foregoing embodiment, the embodiment of the present invention further provides a structural schematic diagram of a recommending apparatus, as shown in fig. 23, and fig. 23 is a structural schematic diagram of a recommending apparatus provided by the embodiment of the present invention, where the recommending apparatus includes: a processor 2301 and a memory 2302; the memory 2302 stores one or more programs executable by the processor 2301, which when executed, perform any of the recommended methods of the embodiments described above by the processor 2301.

The embodiment of the invention provides a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors, the programs when executed by the processors implementing the recommendation method as the embodiment of the invention.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A recommendation method, the method comprising:

2. The method of claim 1, wherein the determining of the actual recommendation is preceded by using a pre-set actor-critic recommendation model and the target user state information, the method further comprising:

3. The method of claim 2, wherein the processing the sample recommendation using sample information entropy to determine a sample true evaluation result comprises:

4. A method according to claim 2 or 3, wherein said adjusting parameters of a reviewer recommendation model and an actor recommendation model based on sample confidence parameters, the sample true evaluation results, the sample information entropy, and the sample true rewards comprises:

5. The method of claim 2, wherein a round of recommendations comprises a plurality of moments; the time t and the time t+1 are any time in the plurality of times;

the processing the user state information by using an initial actor-critic recommendation model to obtain a sample recommendation result, a sample evaluation result and a sample actual reward comprises the following steps:

6. The method of claim 3, wherein the processing the sample recommendation using the sample information entropy to obtain a sample real recommendation based on the information entropy comprises:

7. The method of claim 3, the determining a sample true evaluation result using the information entropy-based sample true recommendation result, comprising:

8. The method of claim 6, wherein the obtaining the real sample recommendation result based on the information entropy at the time t according to the sample recommendation result at the time t and the sample intermediate recommendation result at the time t comprises:

9. The method of claim 4, wherein the determining a sample error based on the sample confidence parameter, the sample true evaluation result, and the sample true reward comprises:

10. The method of claim 9, wherein the determining a sample target value based on the qualification trace-based sample true evaluation result at the t-time, the sample true evaluation result at the t+1-time, and the sample true reward comprises:

11. The method of claim 4, wherein said adjusting actor recommendation model parameters using said sample error and said sample information entropy comprises:

and updating the actor recommendation model parameters by using the gradient.

12. The method of claim 1, wherein the determining a real recommendation using a preset actor-critic recommendation model and the target user state information comprises:

13. The method of claim 12, wherein the determining the actual recommendation corresponding to the target user status information based on the initial recommendation and the intermediate recommendation corresponding to the target user status information comprises:

14. A recommendation device, characterized in that the recommendation device comprises a receiving unit and a determining unit; wherein, the liquid crystal display device comprises a liquid crystal display device,

15. A recommendation device, characterized in that the recommendation device comprises:

A memory for storing executable instructions;

a processor for implementing the recommendation method according to any one of claims 1 to 13 when executing executable instructions stored in said memory.

16. A computer readable storage medium storing executable instructions for causing a processor to perform the recommendation method according to any one of claims 1 to 13.