CN114297470A

CN114297470A - Content recommendation method, device, equipment, medium and computer program product

Info

Publication number: CN114297470A
Application number: CN202111249826.8A
Authority: CN
Inventors: 张启华; 刘军宁; 戴渝卓; 郑昆仑; 黄帆; 袁逸凡; 谭显锋; 齐逸岩
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-04-08

Abstract

The application discloses a content recommendation method, a content recommendation device, content recommendation equipment, a content recommendation medium and a computer program product, and relates to the technical field of computers. The method comprises the following steps: acquiring target data of a target account; training parameters of the weight search model based on the state transition condition corresponding to the target data to obtain candidate model parameters; searching weights corresponding to at least two evaluation indexes through a target weight searching model based on the current state information of the target account to obtain a target weight relation between the at least two evaluation indexes; and fusing at least two evaluation indexes through the target weight relationship to obtain recommended content pushed to the target account. The method has the advantages that the optimization goal of maximizing the long-term satisfaction of the user can be achieved, and the weight search model for determining the evaluation indexes can provide more accurate weight relation.

Description

Content recommendation method, device, equipment, medium and computer program product

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, a medium, and a computer program product for recommending content.

Background

The recommendation system is an essential component in the process of obtaining information (such as commodities, articles, pictures, videos and the like) by a user. The recommendation system determines the content recommended to the user by sequencing the content to be recommended. When the contents to be recommended are ranked, generally, the prediction index of the contents to be recommended of the user is used as a ranking standard.

In the related art, if there are a plurality of prediction indexes, the weight relationship of each index when the contents to be recommended are sorted needs to be determined, and the determination method of the weight parameter generally includes manually setting a parameter, searching a hyper-parameter through a network, and the like. The manual parameter setting is that a plurality of groups of values are manually selected as candidate solution vectors of the fusion weight according to subjective experience, and then an optimal solution vector is selected through an A \ B experiment; and the grid search method carries out exhaustive iteration on each possible combination of the selectable hyper-parameter sets, and then selects the optimal hyper-parameter combination as the fusion weight through cross validation.

However, the manual parameter setting and grid searching has a large space, the cost for obtaining user feedback is high, and the knowledge of experts is heavily relied on, so the efficiency of parameter determination is low, and the corresponding recommendation model is only limited to the preference degree of the user for the current recommendation, which results in low accuracy when the recommendation system recommends the content.

Disclosure of Invention

The embodiment of the application provides a content recommendation method, a content recommendation device, content recommendation equipment, a content recommendation medium and a computer program product, which can improve the recommendation accuracy of content recommended to a user. The technical scheme is as follows:

in one aspect, a method for recommending content is provided, and the method includes:

acquiring target data of a target account;

training parameters of a weight search model based on a state transition condition corresponding to the target data to obtain candidate model parameters, wherein the state transition condition is used for indicating account state changes of the target account under a historical weight relationship;

searching weights corresponding to at least two evaluation indexes through a target weight search model based on the current state information of the target account to obtain a target weight relation between the at least two evaluation indexes, wherein model parameters of the target weight search model are candidate model parameters, and the evaluation indexes are indexes indicating that the recommendation condition of recommended content is predicted;

and fusing the at least two evaluation indexes through the target weight relationship to obtain recommended content pushed to the target account.

In another aspect, an apparatus for recommending content is provided, the apparatus including:

the acquisition module is used for acquiring target data of a target account;

the training module is used for training parameters of the weight search model based on a state transition condition corresponding to the target data to obtain candidate model parameters, wherein the state transition condition is used for indicating account state change of the target account under a historical weight relationship;

the determining module is used for searching weights corresponding to at least two evaluation indexes through a target weight searching model based on the current state information of the target account to obtain a target weight relation between the at least two evaluation indexes, wherein model parameters of the target weight searching model are candidate model parameters, and the evaluation indexes are indexes indicating that the recommendation condition of recommended content is predicted;

and the recommendation module is used for fusing the at least two evaluation indexes through the target weight relationship to obtain recommended content pushed to the target account.

In another aspect, a computer device is provided, where the terminal includes a processor and a memory, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the recommendation method for content described in any of the embodiments of the present application.

In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the program code is loaded and executed by a processor to implement the content recommendation method described in any of the embodiments of the present application.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to make the computer device execute the recommendation method of the content described in any of the above embodiments.

The technical scheme provided by the application at least comprises the following beneficial effects:

training to obtain a target weight search model according to a state transition condition corresponding to target data of a target account, determining a target weight relation between evaluation indexes according to current state information of the target account and the target weight search model when content recommendation needs to be performed on the target account, fusing at least two evaluation indexes through the target weight relation to determine recommended content, and pushing the recommended content to a terminal corresponding to the target account. In other words, the weight search model is trained based on the change condition of the account state of the target account indicated by the target data under the historical weight relationship, so that the long-term satisfaction of the user can be maximized as an optimization target, and the weight search model for determining the evaluation index can provide a more accurate weight relationship.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a method for recommending content provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a method for recommending content provided by another exemplary embodiment of the present application;

FIG. 4 is a block diagram of a batch reinforcement learning model framework provided by an exemplary embodiment of the present application;

FIG. 5 is a flow chart of a method for recommending content provided by another exemplary embodiment of the present application;

FIG. 6 is a flow diagram of a security check module provided in an exemplary embodiment of the present application;

FIG. 7 is a block diagram of a recommendation system provided in an exemplary embodiment of the present application;

fig. 8 is a block diagram of a content recommendation apparatus according to an exemplary embodiment of the present application;

fig. 9 is a block diagram of a content recommendation apparatus according to another exemplary embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, terms referred to in the embodiments of the present application are briefly described:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

In the embodiment of the application, the reinforcement learning is applied to the recommendation method of the content, the problem that the traditional recommendation method cannot model the long-range user satisfaction is solved, the long-term user satisfaction can be maximized as an optimization target, and the accuracy of content recommendation is improved.

In combination with the above explanations, the application scenarios of the embodiments of the present application are schematically illustrated, and the content recommendation method provided by the present application can be applied to the following scenarios.

First, the content recommendation method can be applied to a short video platform. The short video platform can push the short video content to the user according to the user preference. Illustratively, the platform server of the short video platform comprises an offline model training module and an online recommendation module, wherein the offline model training module trains model parameters of the weight search model according to target data of a user account, and issues the trained model parameters to the online recommendation module, and the target data comprises historical watching duration records, historical praise records, historical comment records and the like of the user account on the short video. When a user terminal requests for recommending short videos to a platform server, the online recommendation module inputs current state information of a user account into the weight search model, outputs a weight relation corresponding to each evaluation index, predicts a pre-estimated value corresponding to each evaluation index to generate a corresponding proxy function, sequences the short video contents in a short video set to be recommended through the proxy function to obtain a target sequencing queue corresponding to the short video contents, pushes the target sequencing queue to the user terminal, and pushes the short video contents according to the target sequencing queue.

Second, the content recommendation method can be applied to an e-commerce platform. The E-commerce platform can push commodity contents for users. Illustratively, a platform server of the e-commerce platform comprises an off-line model training module and an on-line recommending module, wherein the off-line model training module trains model parameters of the weight search model according to target data of a user account, and issues the trained model parameters to the on-line recommending module, wherein the target data comprises historical commodity purchase records, historical commodity browsing records, historical commodity collection records, shopping cart commodity records and the like of the user account. The model parameters of the offline model training module are applied to the weight search model of the online pushing module and are applied to determination of recommended commodity contents, the platform server feeds the determined recommended commodity contents back to the user terminal, and the user terminal displays the recommended commodities.

Third, the content recommendation method can be applied to a social platform. The social platform comprises an interactive social platform for sharing short real-time information. Illustratively, a platform server of the social platform includes an offline model training module and an online recommendation module, the offline model training module trains model parameters of the weight search model according to target data of the user account, and issues the trained model parameters to the online recommendation module, where the target data includes historical information issue records, historical approval records, historical forwarding records, historical comment records, and the like of the user account. The model parameters of the off-line model training module are applied to the weight search model of the on-line recommendation module and are applied to the determination of the recommendation information content, the platform server feeds back the determined recommendation information content to the user terminal, and the user terminal displays the recommendation information.

The content recommendation method may also be applied to other scenarios, such as a random recommendation function of a music platform, a favorite recommendation function of an article platform, and the like, and the three scenarios are only illustrated here, and no limitation is imposed on a specific application scenario.

Referring to fig. 1, a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application is shown. The implementation environment includes: a terminal 110, a server 120 and a communication network 130.

The terminal 110 includes various types of terminal devices such as a mobile phone, a tablet computer, a desktop computer, a laptop computer, and a vehicle-mounted terminal. The target application in the terminal 110 requests the server 120 for a content recommendation service, illustratively, the target application includes a short video application, an e-commerce application, a social application, an article application, and other applications capable of providing content recommendation, and the target application may be a stand-alone application program, a web application, or an applet in a host program, which is not limited herein. The terminal 110 is further configured to record behavior data of the user on the target application, and upload the behavior data to the server 120.

The server 120 is configured to provide content recommendation services, the server 120 receives behavior data uploaded by the terminal 110, the behavior data is stored in the offline model training module corresponding to a user account, the offline model training module trains the weight search model according to the stored target data of the target account, model parameters obtained by the training are applied to the weight search model of the online recommendation module, after the server 120 receives a content recommendation request of the terminal 110, a weight relationship corresponding to each evaluation index is determined through the weight search model, an agent function is determined through the weight relationship, contents to be recommended are ranked through the agent function, a finally obtained target ranking queue is sent to the terminal 110, and the terminal 110 recommends the content according to the target ranking queue.

It should be noted that the server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The Cloud Technology (Cloud Technology) is a hosting Technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

In some embodiments, the server 120 described above may also be implemented as a node in a blockchain system. The Blockchain (Blockchain) is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. The block chain, which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation.

Illustratively, the terminal 110 and the server 120 are connected via a communication network 130.

Referring to fig. 2, a method for recommending content according to an embodiment of the present application is shown, in the embodiment of the present application, the method is applied to a server shown in fig. 1, and the method includes:

step 201, acquiring target data of a target account.

The content recommendation method in the embodiment of the application is a personalized recommendation method for a target account, and the target account is a user account.

Illustratively, the target data is historical user behavior data, and the historical user behavior data is data recorded by the user terminal according to account behavior of the target account under the condition of user authorization permission. Optionally, taking the application of the method in the present application to a short video platform as an example, the account behavior includes but is not limited to at least one of a praise operation, a collection operation, a video publishing operation, a watching duration, and the like.

And the server correspondingly stores the account behavior uploaded by the terminal and the target account into a user database. In some embodiments, the server is further configured to perform a preprocessing operation on the account behavior, where the preprocessing operation converts the account behavior into sample data for model training, for example, into user portrait data, viewing trajectory data, and the like.

Step 202, training parameters of the weight search model based on the state transition condition corresponding to the target data to obtain candidate model parameters.

The state transition condition is used for indicating the account state change of the target account under the historical weight relationship, the account state change refers to the change of user behavior under different recommended contents, and the different recommended contents are the recommended contents corresponding to different weight relationships. Illustratively, after the target data is processed, the historical state information and the historical weight relationship of the target account can be obtained, and the historical time periods between the historical state information and the historical weight relationship correspond to each other.

Illustratively, the state information includes static information and historical behavior information of the target account, where the static information includes user age, user gender, and geographic location information of the user corresponding to the target account, and the historical behavior information includes a praise behavior, a collection behavior, a release behavior, a viewing duration, and the like of the target account in the target time period.

The historical weight relationship is the weight relationship between evaluation indexes obtained through a historical weight search model when the target account carries out content recommendation in a historical time period, and the evaluation indexes are indexes for predicting the recommendation condition of recommended content.

Illustratively, the target data can generate at least one set of corresponding data of < historical state information, historical weight relation > according to different historical periods.

In some embodiments, the historical state information and the historical weight relationship further correspond to reward information, the reward information (reward) is determined by account behaviors obtained after content recommendation is performed based on the historical weight relationship, that is, after the server determines the historical weight relationship according to the historical state information and determines recommendation content according to the historical weight relationship, the terminal pushes the recommendation content, records user behaviors and sends feedback operation information corresponding to the recommendation content to the server, the feedback operation information includes, but is not limited to, viewing duration, approval behavior, comment behavior, null behavior and the like, and the server can determine the reward information corresponding to the historical state information and the historical weight relationship according to the feedback operation information. In some embodiments, the reward information r may be represented by a formula (i), where s is the historical status information, a is the historical weight relationship, and m_iAnd the ith feedback operation information is shown, wherein i is a positive integer. The calculation method g among the various feedback operation information may be a weighted average method, a normalization sum method, or other calculation methods, and is not limited specifically here.

The formula I is as follows: r (s, a) g (m)₁,m₂,…)

In some embodiments, the historical status information and the historical weight relationship further correspond to next status information, where the next status information is used to indicate status information corresponding to the target account after content recommendation is performed by the historical weight relationship.

In some embodiments, the state transition condition of the target account under the historical weight relationship is indicated by the difference condition between the historical state information and the next state information, and the weight search model is trained based on the state transition condition. Alternatively, the model parameters of the trained weight search model may be randomly initialized model parameters, or may be weight parameters corresponding to a historical weight search model, and are not limited herein.

And 203, searching weights corresponding to the at least two evaluation indexes through a target weight searching model based on the current state information of the target account to obtain a target weight relation between the at least two evaluation indexes.

In some embodiments, the model parameters of the target weight search model are candidate model parameters obtained by training.

Illustratively, the server acquires current state information of the target account, inputs the current state information into the target weight search model, and outputs the current state information to obtain a target weight relationship between the evaluation indexes.

Optionally, the evaluation index may be preset by the system, or may be set individually according to the target account. The evaluation indexes comprise at least two indexes of the target account such as the predicted watching duration of the recommended content, the predicted broadcasting completion, the predicted approval condition, the predicted comment condition and the like.

And 204, fusing at least two evaluation indexes through the target weight relationship to obtain recommended content pushed to the target account.

In the embodiment of the present application, the determination of the recommended content may be determined by:

the method comprises the steps of (I) obtaining evaluation scores corresponding to the content to be recommended and at least two evaluation indexes, calculating a fusion score corresponding to the content to be recommended according to the evaluation scores and a target weight relation between the evaluation indexes, and pushing the content to be recommended to a terminal as recommended content in response to the fusion score reaching a recommended score threshold value. In some embodiments, each evaluation index corresponds to a score prediction model, and the score prediction model can determine the evaluation score of the content to be recommended under the corresponding evaluation index according to the current state information of the target account.

And secondly, sorting the candidate recommended content set based on the target weight relation and the prediction score corresponding to the at least two evaluation indexes by obtaining the prediction scores corresponding to the at least two evaluation indexes to obtain a target sorting list, issuing the target sorting list to a terminal corresponding to a target account, and pulling the recommended content from the server by the terminal according to the target sorting list. The index prediction model can obtain the prediction score corresponding to each evaluation index through the current state information of the target account. The index prediction model may be a Multi-Task Learning (MTL) model or another model, and is not limited herein.

In some embodiments, the prediction scores of the evaluation indexes may be combined with a multi-objective ranking fusion method to perform weighted fusion on the prediction scores and obtain a ranking score finally used for ranking. Illustratively, the target sorting fusion method is implemented by a proxy function, and the proxy function may adopt linear weighted fusion or weighted multiplicative fusion, which is not limited herein. In some embodiments, the proxy function is of the form shown in equation two, score is the ranking score, and α ═ is (α)₁，α₂…) is the model parameter to be searched, score_iAnd the estimated value of the ith target output in the MTL model is represented by i, wherein i is a positive integer.

The formula II is as follows: score ═ f_α(score_i)

Optionally, the recommended content includes, but is not limited to, content in the form of text, video, voice, and the like.

In some embodiments, the terminal displays the recommended content after receiving the recommended content, records feedback operation information of the target account for the recommended content, and sends the feedback operation information to the server. After receiving the feedback operation information of the target account on the recommended content, the server generates corresponding target data based on the feedback operation information, and correspondingly stores the target data and the target account into the database.

To sum up, according to the content recommendation method provided in the embodiment of the present application, a target weight search model is obtained through training according to a state transition condition corresponding to target data of a target account, when content recommendation needs to be performed on the target account, a target weight relationship between evaluation indexes is determined according to current state information of the target account and the target weight search model, at least two evaluation indexes are fused through the target weight relationship to be used for determining recommended content, and the recommended content can be pushed to a terminal corresponding to the target account. In other words, the weight search model is trained based on the change condition of the account state of the target account indicated by the target data under the historical weight relationship, so that the long-term satisfaction of the user can be maximized as an optimization target, and the weight search model for determining the evaluation index can provide a more accurate weight relationship.

Referring to fig. 3, a method for recommending content according to an embodiment of the present application is shown, in which a training process of a weight search model is schematically illustrated, and the method includes:

step 301, acquiring target data of a target account.

Illustratively, the target data is data recorded by the user terminal according to the account behavior of the target account. Optionally, taking the application of the method in the present application to a short video platform as an example, the account behavior includes but is not limited to at least one of a praise operation, a collection operation, a video publishing operation, a watching duration, and the like.

And the server correspondingly stores the account behavior uploaded by the terminal and the target account into a user database.

Step 302, generating state transition data according to the target data.

The state transition data is used for indicating the change condition of the historical account state of the target account, the change condition of the historical account state is determined after content recommendation is carried out on the historical state information under the historical weight relationship, and the historical time period between the historical state information and the historical weight relationship corresponds.

In the embodiment of the present application, the state transition data is composed of quadruplets (s, a, r, s'), that is, the state transition data includes historical state information (state), historical weight relationship (action), reward information (reward), and next state information.

The state information comprises static information and historical behavior information of the target account, wherein the static information comprises user age, user gender, geographical location information of the user and the like corresponding to the target account, and the historical behavior information comprises approval behavior, collection behavior, release behavior, watching duration and the like of the target account in a target time period.

The reward information is determined by account state change obtained after content recommendation is performed based on the historical weight relationship, namely, the server determines the historical weight relationship according to the historical state information, and after the recommendation content is determined according to the historical weight relationship, the terminal pushes the recommendation content, records user behaviors, and sends feedback operation information corresponding to the recommendation content to the server, wherein the feedback operation information comprises but is not limited to viewing duration, approval behaviors, comment behaviors, null behaviors and the like, and the server can determine the reward information corresponding to the historical state information and the historical weight relationship according to the feedback operation information.

Step 303, training the parameters of the weight search model based on the state transition data to obtain candidate model parameters.

Illustratively, the weight search model may be implemented by different deep learning models, such as an enhanced learning model, a migration learning model, an inductive learning model, a teaching learning model, and the like, according to different application scenarios.

In the embodiment of the present application, the weight search model adopts a Batch Constrained Q-Learning (BCQ) model to reduce the influence of overestimation and extrapolation errors in the reinforcement Learning training process. Schematically, the overall framework of the BCQ model is shown in fig. 4, and the BCQ model 400 includes a variational Auto-Encoder (VAE) network 410, a perturbation network 420 and an evaluation network 430. The VAE network 410 includes an encoder 411 and a decoder 412, and the VAE network 410 is configured to generate n weight relationships according to the distribution of the training data, where n is a positive integer. The disturbance network 420 is used for determining the maximum weight relationship of the accumulated reward information, and the evaluation network 430 is used for calculating the accumulated reward corresponding to the weight relationship and reducing the time difference error of the weight relationship output by the weight search model.

Illustratively, in the training process of the BCQ model, training of a target number of cycles needs to be completed, where the target number of cycles may be preset by the system, or may be determined according to a loss function of the model, for example, after a single cycle of training is completed, a loss value of the model is determined according to the loss function, and in response to the loss value being smaller than a preset threshold, convergence of the model is determined, that is, the entire training process of the model is completed.

In a single-cycle training process of the BCQ model, the VAE network is first trained. The method includes the steps that parameters of the VAE network are trained based on historical state information and historical weight relations, and first network parameters are obtained, wherein the first network parameters are network parameters obtained by training the VAE network in the current cycle training process. Schematically, inputting historical state information into a variational self-coding network, and outputting to obtain a first training weight relationship; and performing iterative training on the parameters of the variational self-coding network based on the difference between the first training weight relationship and the historical weight relationship to obtain first network parameters. In some embodiments, the network parameter ω of the VAE is updated according to the formula three, s, a is from the state transition quadruple, a' is the predicted weight relationship of the VAE network output, i.e., the first training weight relationship, i.e., the historical state information(s), the historical weight relationship (a) in the state transition data,

meaning that z follows a gaussian distribution,

representing a 0-1 distribution, argmin represents the minimization of the formula, D_KLRepresenting the calculated relative entropy.

The formula III is as follows:

wherein the content of the first and second substances,

for the purpose of an encoder in a VAE network,

is a decoder in a VAE network.

After the first network parameters are determined, n first weight relations corresponding to the historical state information are generated based on the variational self-coding network under the first network parameters, parameters of the weight search model are trained based on the n first weight relations and the historical state information, and final candidate model parameters are obtained, wherein the first weight relations accord with data distribution of state transition data, and n is a positive integer.

Illustratively, after the training of the variational self-coding network is completed, a variational self-coder is used to generate a first weight parameter which accords with the data distribution of the state transition data, that is, n first weight relations corresponding to the historical state information are generated based on the variational self-coding network under the first network parameter, where n is a positive integer, and the parameter of the weight search model is trained based on the n first weight relations and the historical state information to obtain the candidate model parameter.

In some embodiments, the subsequent perturbation network and evaluation network are trained by the first weight relationship and the historical state information. Illustratively, generating disturbance weight relations corresponding to the n first weight relations based on the disturbance initial parameters; training parameters of the evaluation network based on the disturbance weight relationship and the state transition data to obtain second network parameters; determining a third network parameter corresponding to the disturbance network based on the second network parameter and the historical state information; and training the parameters of the weight search model based on the second network parameters and the third network parameters to obtain candidate model parameters.

In some embodiments, the second network parameter is determined by the perturbation weight relationship, the reward information, and the next state information s' in the state transition quadruple. Substituting the disturbance weight relationship, the reward information and the next state information into an evaluation optimization function to obtain an optimization parameter of the evaluation network; acquiring accumulated reward information corresponding to the historical weight relationship and the second weight relationship; training the parameters of the evaluation network based on the Difference between the optimized parameters and the cumulative reward information to obtain second network parameters, and specifically training the parameters of the evaluation network based on a minimized time-Difference (TD) error. In one example, the valuation optimization function is represented by equation four, where r is the reward information in the state transition quadruple, γ is the future reward discount decay parameter, 0 < γ < 1, a'_i，pFor the above-mentioned perturbation weight relationship, λ is the dual network balance weight,

evaluating the network for a target, and comparing the evaluated network with the current evaluation network

And

have the same network structure.

The formula four is as follows:

and the updating formula corresponding to the network parameter theta of the current evaluation network is shown as a formula five, wherein s is a historical stateInformation, a is historical weight relation, y is obtained by formula four, Q_θTo evaluate the network, B is the current training sample.

The formula five is as follows: θ ← argmin_θ∑_(s，a)∈B(y-Q_θ(s，a))²

In some embodiments, the third network parameter is determined by the second network parameter, the historical status information s, and the weight relationship output under the first network parameter. Namely, inputting the historical state information into a variational self-coding network under the first network parameter, and outputting a second weight relation; and training the disturbance initial parameters of the disturbance network based on the second network parameters, the historical state information and the second weight relation to obtain third network parameters, wherein when the current cyclic training process is the first cyclic training process, the disturbance initial parameters are parameters for randomly initializing the disturbance network, and when the current cyclic training process is not the first cyclic training process, the disturbance initial parameters are parameters of the disturbance network obtained in the last cyclic training process.

In one example, the updated formula corresponding to the network parameter of the perturbation network is shown as formula six, wherein,

G_ω(s) is as described above, i.e

For the second weight relationship output from the encoder for the variation under the first network parameter,

in order to perturb the coefficient network,

for evaluating the network, B is a current training sample, s is historical state information, and a is historical weight information.

Formula six:

in some embodiments, the BCQ model uses formula seven and formula eight to perform a delayed update on the target network, where τ is the update rate of the target network, φ is a network parameter of the perturbed network, and θ is a network parameter of the evaluated network.

The formula seven: θ '. about.τ θ + (1- τ) θ'

The formula eight: phi '. o.c.. tau.phi + (1-tau) phi'

In conclusion, the update training algorithm of the BCQ model is as follows:

(a) determining input data: training sample

The method comprises the steps of cycle number T, target network update rate tau, mini-batch size N, maximum disturbance amount rho, action sampling number N and double-network balance weight lambda. The training sample is a sample generated by target data of a target account.

(b) Initializing parameters: using random parameters theta respectively₁、θ₂、

And omega initialization evaluation network

Disturbance network xi_φAnd VAE model

And initializes the target network

And xi_φ′The parameter is theta'₁←θ₁、θ′₂←θ₂And phi' ← phi.

(c) And (3) cyclic training: fort 1 … T

(c1) From

Mid-sampling N state transition quadruplets (s, a, r, s');

(c2)

(c3) updating the VAE network through a formula III;

(c4) generating n weight relationships for each sample according to the VAE network:

(c5) generating a disturbance weight relation:

(c6) calculating y through a formula IV;

(c7) the evaluation network is updated by the formula five:

(c8) updating the perturbation network by formula six:

(c9) updating the target network through a formula seven and a formula eight;

(c10)end

and 304, updating the parameters of the weight search model of the (i-1) th time through the candidate model parameters to obtain a target weight search model.

The candidate model parameters are parameters of the candidate ith weight search model, and i is a positive integer. And after the candidate model parameters are determined, updating the parameters of the preposed weight search model, wherein the preposed weight search model is the (i-1) th weight search model, and finally obtaining the target weight search model.

And 305, searching weights corresponding to the at least two evaluation indexes through a target weight searching model based on the current state information of the target account to obtain a target weight relation between the at least two evaluation indexes.

And step 306, fusing the at least two evaluation indexes through the target weight relationship to obtain recommended content pushed to the target account.

In the embodiment of the application, the prediction scores corresponding to the at least two evaluation indexes are obtained, the candidate recommended content sets are sorted based on the target weight relationship and the prediction scores corresponding to the at least two evaluation indexes to obtain a target sorting list, the target sorting list is issued to the terminal corresponding to the target account, and the terminal pulls the recommended content from the server according to the target sorting list. The index prediction model can obtain the prediction score corresponding to each evaluation index through the current state information of the target account. The index prediction model may be an MTL model or another model, and is not limited herein.

In the embodiment of the application, the off-line reinforcement learning method is adopted to perform the weight search of multi-target sequencing fusion, and the training sample is generated through the historical watching behavior of the user, so that the cost for obtaining the feedback of the user is reduced.

Please refer to fig. 5, which illustrates a content recommendation method according to an embodiment of the present application, in the embodiment of the present application, a recommendation system in a server includes an offline model training subsystem and an online recommendation subsystem, where the offline model training subsystem is configured to train parameters of a weight search model according to target data to obtain candidate model parameters. The online recommendation subsystem is used for acquiring a recommendation request of the terminal in real time and recommending contents according to the recommendation request. The method comprises the following steps:

step 501, acquiring target data of a target account.

In the embodiment of the present application, the off-line model training subsystem includes three main modules: the system comprises a data preprocessing and sample generating module, a model training module and a safety inspection module. The data preprocessing and sample application module aims to convert data in a user database into input suitable for a model, and the data in the user database is feedback operation information of a target account received by an online recommendation subsystem.

Step 502, training parameters of the weight search model based on the state transition condition corresponding to the target data to obtain candidate model parameters.

In the embodiment of the present application, the parameter training of the weight search model is completed in the model training module, the weight search model is implemented by using a BCQ model, and the specific training process is shown in steps 302 to 303 and is not described herein again.

Step 503, obtaining the test weight distribution corresponding to the candidate weight search model.

The candidate weight search model is a weight search model composed of candidate model parameters.

In the embodiment of the present application, the test weight distribution is a distribution of a weight relationship output after the test sample is input to the candidate weight search model.

In the embodiment of the application, the candidate weight search model obtained by the off-line model training subsystem needs to pass through the security check module, and the security check module aims to restrict the difference between the new model and the previous original model and prevent the influence of the severe change of the model on the on-line service.

Step 504, obtain historical weight distribution.

The historical weight distribution corresponds to the weight search model of the (i-1) th time, namely, the historical weight distribution is the distribution situation of the weight relation of the test sample obtained by the weight search model of the (i-1) th time.

And 505, in response to the fact that the difference between the test weight distribution and the historical weight distribution meets the constraint requirement, updating the parameters of the weight search model of the (i-1) th time through the candidate model parameters to obtain a target weight search model.

In some embodiments, the security check module checks the historical distribution of weight relationships for differences from the new model after each model training, so that the change in each dimension does not exceed a preset threshold e. After the total R times of model training, if the weight relation distribution meets the requirement, a new model is issued; otherwise, keeping the original model parameters unchanged.

And step 506, in response to the failure of matching between the difference between the test weight distribution and the historical weight distribution and the constraint requirement, taking the weight search model of the (i-1) th time as a target weight search model.

Referring to fig. 6, a flow chart of the security check module is shown, which includes: model training 601; a statistical weight relationship distribution 602; judging whether the stability check is met 603, if so, executing 604, and if not, executing 606; derive the model 604 and update the historical weight distribution 605; judging that the current retry number is less than R606, if so, executing 601, and if not, ending; publishing model 607.

And 507, searching weights corresponding to the at least two evaluation indexes through a target weight searching model based on the current state information of the target account to obtain a target weight relation between the at least two evaluation indexes.

In the embodiment of the application, the online recommendation subsystem comprises a proxy function calculation module, a recommendation item sequencing and issuing module and a user behavior pipelining recording module. After the terminal sends a recommendation request, the online request server sends the current state information corresponding to the target account to the target weight search model, and outputs a weight relation vector, namely a target weight relation.

Illustratively, the target weight search model is a BCQ model, which includes a VAE network generating a weight relationship (action)

Disturbance network xi_φAnd dual evaluation network

Wherein

Comprising a coding module

And a decoding module

Generating n distributions of training data after given current state information

Disturbance network xi_φInput a generated for Current State information and VAE network_iThe output is the action a after the disturbance is increased_i，p＝a_i+ξ_φ(s，a_iΦ), the network aims at choosing the appropriate a_pSo that the current jackpot prize Q_θ(s, a) max, where θ is the evaluation network parameter. This process may be formally expressed by equation nine.

The formula is nine:

and step 508, acquiring the prediction scores corresponding to the at least two evaluation indexes.

In the embodiment of the application, the online request server inputs the current state information of the target account into the MTL ranking model, and can output the prediction score corresponding to each evaluation index.

Step 509, determining a function parameter corresponding to the target agent function based on the target weight relationship and the prediction score corresponding to the at least two evaluation indexes.

The proxy function is used for determining content recommendation scores corresponding to the candidate recommended contents.

In this embodiment of the present application, the online request server inputs the target weight relationship output by the target weight search model and the prediction score corresponding to the evaluation index output by the MTL ranking model to the proxy function calculation module, so as to determine the function parameter corresponding to the proxy function. In one example, the proxy function is shown as equation ten, where α is_iAs a target weight relationship, score_iAn estimate, β, output for the ith evaluation index in the MTL ranking model_iBias constants specified according to a priori knowledge.

Formula ten: score ═ α_ilog(score_i+β_i)

In step 510, recommendation scores of the candidate recommended contents in the candidate recommended content set are determined based on the target agent function.

In the embodiment of the application, the recommendation score of the candidate recommended content in the candidate recommended content set can be determined through the proxy function.

Step 511, generating a target ranking list based on the recommendation scores of the candidate recommended contents.

Illustratively, the recommendation item ranking and issuing module obtains recommendation scores according to the proxy function to correspondingly generate a target ranking list, and the contents to be recommended in the target ranking list are ranked according to the recommendation scores. Optionally, the target ordered list may include all the contents in the set of contents to be recommended, or may include part of the contents in the set of contents to be recommended, which is not limited herein.

And step 512, issuing the target sorting queue to a terminal corresponding to the target account.

Illustratively, the recommendation item sorting and issuing module issues the generated target sorting queue to the terminal corresponding to the target account. In some embodiments, the target sorting queue includes a content identifier corresponding to the recommended content, and the terminal performs content pulling from the server according to the recommended content sequence indicated by the target sorting queue according to the content identifier in the target sorting queue.

In some embodiments, the terminal displays the recommended content after receiving the recommended content, records feedback operation information of the target account for the recommended content, and sends the feedback operation information to the server. And after receiving feedback operation information of the recommended content by the target account, a user operation recording module in the online recommendation subsystem generates corresponding target data based on the feedback operation information and stores the target data into a user database in the online model training subsystem.

Schematically, a structure of a recommendation system 700 provided in the embodiment of the present application is shown in fig. 7, and the recommendation system 700 includes an offline model training subsystem 710 and an online recommendation subsystem 720, where the offline model training subsystem 710 includes a user database 711, a data preprocessing module 712, a sample database 713, a model training module 714, and a security check module 715. The online recommendation subsystem 720 includes a prediction model module 721, a request server 722, a proxy function calculation module 723, a recommended item ranking and issuing module 724, and an operation recording module 725, wherein the prediction model module 721 includes a reinforcement learning model and an MTL model.

The feedback operation information corresponding to the target account in the terminal 730 is sent to the operation recording module 725, the operation recording module 725 sends the data to the user database 711, the user database 711 stores the feedback operation information as the target data of the target account corresponding to the target account, the data preprocessing module 712 acquires the target data from the user database 711 for preprocessing to obtain training sample data, and stores it in the sample database 713, the model training module 714 reads training sample data from the sample database 713, and training the reinforcement learning model to obtain candidate model parameters, sending the candidate model parameters to the security check module 715, if the candidate model parameters meet the constraint requirements, the security check module 715 updates the candidate model parameters to the reinforcement learning model of the predictive model module 721 in the online recommendation subsystem 720. After receiving a recommendation request sent by the terminal 730, the request server 722 sends current state information corresponding to a target account to the prediction model module 721, a reinforcement learning model in the prediction model module 721 outputs a target weight relationship corresponding to an evaluation index, an MTL model outputs a prediction score corresponding to the evaluation index, the request server 722 sends the target weight relationship and the prediction score to the proxy function calculation module 723, the proxy function calculation module 723 determines a function parameter corresponding to a proxy function, a recommendation item ranking and issuing module 724 determines a recommendation score corresponding to recommended content in a content set to be recommended according to the function parameter, generates a target ranking queue, and sends the target ranking queue to the terminal 730, and the terminal 730 sends feedback operation information of the recommended content corresponding to the target ranking queue to the operation recording module 725.

In some embodiments, the training frequency of the offline model training module may be preset, for example, the offline model training module performs the training of the model parameters once a day; or determined according to the activity condition of the target account, namely, in response to the activity degree of the target account meeting the preset activity requirement, performing parameter training at a first preset training frequency; and responding to the fact that the activity degree of the target account does not meet the preset activity requirement, and performing parameter training at a second preset training frequency, wherein the first preset training frequency is higher than the second preset training frequency. Optionally, the activity may be determined according to information such as an application online time length and an application use frequency of the target account.

In the embodiment of the application, the training sample is generated through the historical watching behavior of the user, the cost for obtaining the feedback of the user is reduced, and the use duration and the number of times of the account interaction behavior corresponding to the application can be prolonged by weight search based on the reinforcement learning model. Meanwhile, a closed-loop mode of offline training, online real-time recommendation and data acquisition is adopted, model tuning can be performed according to the latest preference of a user, and the real-time performance of a recommendation system is guaranteed.

Referring to fig. 8, a block diagram of a device for recommending content according to an exemplary embodiment of the present application is shown, where the device includes the following modules:

an obtaining module 810, configured to obtain target data of a target account;

a training module 820, configured to train parameters of a weight search model based on a state transition condition corresponding to the target data, to obtain candidate model parameters, where the state transition condition is used to indicate account state change of the target account under a historical weight relationship;

a determining module 830, configured to search, based on the current state information of the target account, weights corresponding to at least two evaluation indexes through a target weight search model to obtain a target weight relationship between the at least two evaluation indexes, where a model parameter of the target weight search model is the candidate model parameter, and the evaluation index is an index indicating that a recommendation condition of recommended content is predicted;

the recommending module 840 is configured to fuse the at least two evaluation indexes according to the target weight relationship to obtain recommended content to be pushed to the target account.

In some optional embodiments, as shown in fig. 9, the training module 820 further includes:

a generating unit 821, configured to generate state transition data according to the target data, where the state transition data is used to indicate a change condition of a historical account state of the target account, the change condition of the historical account state is determined after content recommendation is performed on the historical state information under the historical weight relationship, and a historical period between the historical state information and the historical weight relationship corresponds to each other;

a training unit 822, configured to train parameters of the weight search model based on the state transition data, so as to obtain the candidate model parameters.

In some optional embodiments, the weight search model comprises a variational self-coding network, the state transition data comprises the historical state information and the historical weight relationship;

the training unit 822 is further configured to train parameters of the variational self-coding network based on the historical state information and the historical weight relationship to obtain a first network parameter;

the generating unit 821 is further configured to generate n first weight relationships corresponding to the historical state information based on a variational self-encoding network under the first network parameter, where the first weight relationships conform to data distribution of the state transition data, and n is a positive integer;

the training unit 822 is further configured to train parameters of the weight search model based on the n first weight relationships and the historical state information, so as to obtain the candidate model parameters.

In some optional embodiments, the generating unit 821 is further configured to input the historical state information to the variational self-coding network, and output a first training weight relationship;

the training unit 822 is further configured to perform iterative training on the parameter of the variational self-coding network based on a difference between the first training weight relationship and the historical weight relationship, so as to obtain the first network parameter.

In some optional embodiments, the weight search model further comprises a disturbance network and an evaluation network, the disturbance network is used for determining the weight relationship with the largest accumulated reward information, the evaluation network is used for reducing the time difference error of the weight relationship output by the weight search model, and the disturbance network corresponds to the disturbance initial parameter;

the generating unit 821 is further configured to generate a disturbance weight relationship corresponding to the n first weight relationships based on the disturbance initial parameter;

the training unit 822 is further configured to train parameters of the evaluation network based on the disturbance weight relationship and the state transition data to obtain second network parameters;

the generating unit 821 is further configured to determine a third network parameter corresponding to the disturbance network based on the second network parameter and the historical state information;

the training unit 822 is further configured to train parameters of the weight search model based on the second network parameter and the third network parameter, so as to obtain the candidate model parameters.

In some optional embodiments, the generating unit 821 is further configured to input the historical status information to a variational self-coding network under the first network parameter, and output a second weight relationship;

the training unit 822 is further configured to train the disturbance initial parameter of the disturbance network based on the second network parameter, the historical state information, and the second weight relationship, so as to obtain a third network parameter.

In some optional embodiments, the state transition data further includes reward information determined by account state changes obtained by content recommendation based on the historical weight relationship, and next state information used for indicating state information corresponding to the target account after content recommendation is performed by the historical weight relationship;

the generating unit 821 is further configured to substitute the disturbance weight relationship, the reward information, and the next state information into an evaluation optimization function to obtain an optimization parameter of the evaluation network;

the generating unit 821 is further configured to obtain cumulative reward information corresponding to the historical weight relationship and the second weight relationship;

the training unit 822 is further configured to train the parameters of the evaluation network based on the difference between the optimized parameter and the cumulative reward information, so as to obtain the second network parameter.

In some optional embodiments, the candidate model parameter is a parameter of an ith weight search model of the candidate, i is a positive integer;

the device further comprises: and an updating module 850, configured to update parameters of the weight search model of the i-1 st time according to the candidate model parameters, so as to obtain the target weight search model.

In some optional embodiments, the update module 850 further includes:

a first obtaining unit 851, configured to obtain test weight distribution corresponding to a candidate weight search model, where the candidate weight search model is a weight search model formed by candidate model parameters;

the first obtaining unit 851 is further configured to obtain historical weight distribution, where the historical weight distribution corresponds to the weight search model of the i-1 st time;

a determining unit 852, configured to update parameters of the weight search model of the i-1 th time through the candidate model parameters in response to that a difference between the test weight distribution and the historical weight distribution meets a constraint requirement, so as to obtain the target weight search model.

In some optional embodiments, the determining unit 852 is further configured to take the i-1 st weight search model as the target weight search model in response to a failure in matching the difference between the test weight distribution and the historical weight distribution with the constraint requirement.

In some optional embodiments, the recommending module 840 further includes:

a second obtaining unit 841, configured to obtain the prediction scores corresponding to the at least two evaluation indexes;

a sorting unit 842, configured to sort the candidate recommended content sets based on the target weight relationships and the prediction scores corresponding to the at least two evaluation indexes, so as to obtain a target sorted list;

a pushing unit 843, configured to send the target sorting queue to a terminal corresponding to the target account.

In some optional embodiments, the sorting unit 842 is further configured to determine a function parameter corresponding to a target proxy function based on the target weight relationship and the predicted score corresponding to the at least two evaluation indexes, where the proxy function is configured to determine a content recommendation score corresponding to the candidate recommended content;

the sorting unit 842 is further configured to determine recommendation scores of candidate recommended contents in the candidate recommended content set based on the target agent function;

the sorting unit 842 is further configured to generate the target sorted list based on the recommendation scores of the candidate recommended contents.

In some optional embodiments, the apparatus further comprises:

a feedback module 860, configured to receive feedback operation information of the recommended content from the target account;

the feedback module 860 is further configured to generate the target data based on the feedback operation information;

the feedback module 860 is further configured to store the target data and the target account into a database correspondingly.

To sum up, the content recommendation device provided in this embodiment of the present application trains to obtain the target weight search model according to the state transition condition corresponding to the target data of the target account, determines the target weight relationship between the evaluation indexes according to the current state information of the target account and the target weight search model when content recommendation needs to be performed on the target account, and fuses at least two evaluation indexes through the target weight relationship to be used for determining the recommended content, that is, the recommended content can be pushed to the terminal corresponding to the target account. In other words, the weight search model is trained based on the change condition of the account state of the target account indicated by the target data under the historical weight relationship, so that the long-term satisfaction of the user can be maximized as an optimization target, and the weight search model for determining the evaluation index can provide a more accurate weight relationship.

It should be noted that: the content recommendation apparatus provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the content recommendation device and the content recommendation method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments, and are not described herein again.

Fig. 10 shows a schematic structural diagram of a server provided in an exemplary embodiment of the present application. Specifically, the structure includes the following.

The server 1000 includes a Central Processing Unit (CPU) 1001, a system Memory 1004 including a Random Access Memory (RAM) 1002 and a Read Only Memory (ROM) 1003, and a system bus 1005 connecting the system Memory 1004 and the Central Processing Unit 1001. The server 1000 also includes a mass storage device 1006 for storing an operating system 1013, application programs 1014, and other program modules 1015.

The mass storage device 1006 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1006 and its associated computer-readable media provide non-volatile storage for the server 1000. That is, the mass storage device 1006 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read Only Memory (CD-ROM) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1004 and mass storage device 1006 described above may be collectively referred to as memory.

According to various embodiments of the present application, the server 1000 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 1000 may be connected to the network 1012 through a network interface unit 1011 connected to the system bus 1005, or the network interface unit 1011 may be used to connect to another type of network or a remote computer system (not shown).

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

Embodiments of the present application further provide a computer device, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the biometric identification method provided by the above-mentioned method embodiments. Alternatively, the computer device may be a terminal or a server.

Embodiments of the present application further provide a computer-readable storage medium having at least one instruction, at least one program, code set, or instruction set stored thereon, loaded and executed by a processor, to implement the biometric identification method provided by the above method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the biometric method described in any of the above embodiments.

Optionally, the computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for recommending content, the method comprising:

acquiring target data of a target account;

2. The method of claim 1, wherein the training parameters of the weight search model based on the state transition condition corresponding to the target data to obtain candidate model parameters comprises:

generating state transition data according to the target data, wherein the state transition data is used for indicating the change condition of the historical account state of the target account, the change condition of the historical account state is determined after content recommendation is carried out on the historical state information under the historical weight relationship, and the historical time period between the historical state information and the historical weight relationship corresponds to each other;

and training the parameters of the weight search model based on the state transition data to obtain the candidate model parameters.

3. The method of claim 2, wherein the weight search model comprises a variational self-coding network, and wherein the state transition data comprises the historical state information and the historical weight relationship;

training the parameters of the weight search model based on the state transition data of the target number to obtain the candidate model parameters, including:

training parameters of the variational self-coding network based on the historical state information and the historical weight relationship to obtain first network parameters;

generating n first weight relations corresponding to the historical state information based on the variational self-coding network under the first network parameter, wherein the first weight relations accord with the data distribution of the state transition data, and n is a positive integer;

and training parameters of the weight search model based on the n first weight relationships and the historical state information to obtain the candidate model parameters.

4. The method of claim 3, wherein training the parameters of the variational self-coding network based on the historical state information and the historical weight relationship to obtain first network parameters comprises:

inputting the historical state information into the variational self-coding network, and outputting to obtain a first training weight relationship;

and performing iterative training on the parameters of the variational self-coding network based on the difference between the first training weight relationship and the historical weight relationship to obtain the first network parameters.

5. The method according to claim 3, wherein the weight search model further comprises a perturbation network and an evaluation network, the perturbation network is used for determining the weight relationship with the maximum cumulative reward information, the evaluation network is used for reducing the time difference error of the weight relationship output by the weight search model, and the perturbation network corresponds to perturbation initial parameters;

the training the parameters of the weight search model based on the n first weight relationships and the historical state information to obtain the candidate model parameters includes:

generating disturbance weight relations corresponding to the n first weight relations based on the disturbance initial parameters;

training the parameters of the evaluation network based on the disturbance weight relationship and the state transition data to obtain second network parameters;

determining a third network parameter corresponding to the disturbance network based on the second network parameter and the historical state information;

and training the parameters of the weight search model based on the second network parameters and the third network parameters to obtain the candidate model parameters.

6. The method of claim 5, wherein determining a third network parameter corresponding to the perturbation network based on the second network parameter and the historical state information comprises:

inputting the historical state information into a variation self-coding network under the first network parameter, and outputting a second weight relation;

training the disturbance initial parameter of the disturbance network based on the second network parameter, the historical state information and the second weight relation to obtain a third network parameter.

7. The method according to claim 6, wherein the state transition data further includes reward information determined by account state change obtained by content recommendation based on the historical weight relationship and next state information indicating state information corresponding to the target account after content recommendation based on the historical weight relationship;

training the parameters of the evaluation network based on the disturbance weight relationship and the state transition data to obtain second network parameters, including:

substituting the disturbance weight relationship, the reward information and the next state information into an evaluation optimization function to obtain an optimization parameter of the evaluation network;

acquiring accumulated reward information corresponding to the historical weight relationship and the second weight relationship;

training the parameters of the evaluation network based on the difference between the optimized parameters and the cumulative reward information to obtain the second network parameters.

8. The method according to any one of claims 1 to 7, wherein the candidate model parameters are parameters of an ith weight search model of the candidate, i is a positive integer;

before the searching for the weights corresponding to the at least two evaluation indexes through the target weight search model based on the current state information of the target account to obtain the target weight relationship between the at least two evaluation indexes, the method further includes:

and updating the parameters of the (i-1) th weight search model through the candidate model parameters to obtain the target weight search model.

9. The method according to claim 8, wherein the updating parameters of the i-1 st weight search model by the candidate model parameters to obtain the target weight search model comprises:

obtaining test weight distribution corresponding to a candidate weight search model, wherein the candidate weight search model is a weight search model formed by candidate model parameters;

obtaining historical weight distribution, wherein the historical weight distribution corresponds to the weight search model of the (i-1) th time;

and in response to that the difference between the test weight distribution and the historical weight distribution meets the constraint requirement, updating the parameters of the weight search model for the (i-1) th time through the candidate model parameters to obtain the target weight search model.

10. The method of claim 9, further comprising:

in response to a failure of a match between the difference between the test weight distribution and the historical weight distribution and the constraint requirement, taking the i-1 st weight search model as the target weight search model.

11. The method according to any one of claims 1 to 7, wherein the fusing the at least two evaluation indexes through the target weight relationship to obtain the recommended content to be pushed to the target account includes:

acquiring the prediction scores corresponding to the at least two evaluation indexes;

based on the target weight relation and the prediction score corresponding to the at least two evaluation indexes, sorting the candidate recommended content set to obtain a target sorting list;

and issuing the target sorting queue to a terminal corresponding to the target account.

12. The method of claim 11, wherein the ranking the set of candidate recommended content based on the target weight relationship and the prediction score corresponding to the at least two evaluation indicators to obtain a target ranking list comprises:

determining a function parameter corresponding to a target proxy function based on the target weight relationship and the prediction score corresponding to the at least two evaluation indexes, wherein the proxy function is used for determining a content recommendation score corresponding to candidate recommended content;

determining recommendation scores for candidate recommended content in the set of candidate recommended content based on the target agent function;

generating the target ranking list based on the recommendation scores of the candidate recommended content.

13. The method of any of claims 1 to 7, further comprising:

receiving feedback operation information of the target account on the recommended content;

generating the target data based on the feedback operation information;

and correspondingly storing the target data and the target account into a database.

14. An apparatus for recommending contents, said apparatus comprising:

the acquisition module is used for acquiring target data of a target account;

15. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method of recommending content according to any of claims 1 to 13.

16. A computer-readable storage medium having at least one program code stored therein, the program code being loaded and executed by a processor to implement the method for recommending contents according to any one of claims 1 to 13.

17. A computer program product comprising a computer program/instructions stored in a computer-readable storage medium, wherein the computer program/instructions are read by a processor of a computer device from the computer-readable storage medium, and the processor executes the computer program/instructions to cause the computer device to execute to implement the recommendation method for content according to any one of claims 1 to 13.