CN113449176A

CN113449176A - Recommendation method and device based on knowledge graph

Info

Publication number: CN113449176A
Application number: CN202010216920.2A
Authority: CN
Inventors: 刘青; 唐睿明; 何秀强; 周思锦; 张伟楠; 俞勇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-03-24
Filing date: 2020-03-24
Publication date: 2021-09-28

Abstract

The application discloses a knowledge graph-based recommendation method in the field of artificial intelligence, which comprises the following steps: acquiring a plurality of historical click objects and a neighbor object of each historical click object in the plurality of historical click objects; acquiring current state parameters according to the plurality of historical click objects and the neighbor object of each historical click object in the plurality of historical click objects; determining a candidate recommendation object set according to the knowledge graph, all objects and a plurality of historical click objects; and calculating according to the current state parameters to obtain the expected value of each candidate recommended object in the candidate recommended object set, and determining the candidate recommended object corresponding to the maximum expected value as the target recommended object. The recommendation method is suitable for various application scenes relevant to recommendation, such as APP recommendation of an APP application market, audio/video recommendation of an audio/video website, information recommendation of an information platform and the like. By the adoption of the method and the device, recommendation efficiency and accuracy are improved.

Description

Recommendation method and device based on knowledge graph

Technical Field

The application relates to the field of artificial intelligence, in particular to a recommendation method and device based on a knowledge graph.

Background

Recommendation and search are among the important research directions in the field of artificial intelligence. In the construction target of the personalized recommendation system, the most important thing is to accurately predict the demand or preference degree of a user for a specific article, and perform corresponding recommendation for the specific article based on a judgment result, which not only affects the user experience, but also directly affects the income of enterprise-related products, such as use frequency or download and click volume. Therefore, the method has important significance for predicting the user behavior demand or preference. At present, basic and mainstream prediction methods are recommendation system models based on supervised learning (supervised learning). The main problems of the recommendation system based on supervised learning modeling are as follows: (1) supervised learning treats the recommendation process as a static prediction process, with the user's interests not changing over time, but, in fact, the recommendation should be a dynamic sequential decision process, with the user's interests possibly changing over time. (2) Supervised learning maximizes the instant rewards, such as click-through rate, for recommended outcomes, and many times items with small instant rewards but large future rewards should be considered.

In recent years, reinforcement learning has made a great breakthrough in many dynamic interactive and long-term planning scenarios, such as unmanned driving and games. Conventional reinforcement learning methods include value-based methods and policy-based methods. Firstly, training and learning to obtain a Q function; then calculating the Q values of all the objects to be recommended according to the current state; and finally, selecting the action object with the maximum Q value for recommendation. Firstly, training and learning to obtain a strategy function; and then according to the current state, the strategy determines the optimal action object to recommend. When the value-based reinforcement learning method learning recommendation system and the strategy-based reinforcement learning method learning recommendation system perform recommendation, all action recommendation objects need to be traversed, and the relevant probability value of each object to be recommended is calculated, so that the time consumption is high, and the efficiency is low.

Disclosure of Invention

The embodiment of the application provides a recommendation method and device based on a knowledge graph, current state parameters are obtained through a history click object and an object of the history click object obtained based on the knowledge graph, so that the current state parameters not only contain the history preference of a user, but also contain the information of neighbor nodes, the representation range and the capability of a model are improved, the interest and hobbies of the user can be captured more accurately, and the accuracy of a recommendation result is improved; the candidate recommendation object set is obtained from all the objects based on the knowledge graph and the historical click objects, so that the object is prevented from being traversed when the objects are recommended, and the recommendation efficiency is improved.

In a first aspect, an embodiment of the present application provides a knowledge graph-based recommendation method, including:

acquiring a plurality of historical click objects and a first neighbor object of each historical click object in the plurality of historical click objects, wherein the first neighbor object of each historical click object comprises an object which is adjacent to the historical click object in the k order on a knowledge graph; k is an integer greater than 0; the k order of the adjacent objects in the k order of the historical click object is used for representing the association degree between the adjacent objects in the k order and the historical click object, and the smaller the k value is, the higher the corresponding association degree is; acquiring current state parameters according to a plurality of historical click objects and a first neighbor object of each historical click object; determining a candidate recommendation object set according to the knowledge graph, all objects and a plurality of historical click objects, wherein the candidate recommendation object set comprises the plurality of historical click objects and a second neighbor object, the second neighbor object comprises an object which is adjacent to each historical click object in the plurality of historical click objects in the m-order in the knowledge graph, and all objects comprise the plurality of historical click objects and a first neighbor object of each historical click object; m is an integer greater than 0; and calculating according to the current state parameters to obtain the expected value of each candidate recommended object in the candidate recommended object set, and determining the candidate recommended object corresponding to the maximum expected value as the target recommended object.

The historical click object is an object which is recommended to the user before and clicked by the user. For example, the object recommended to the user is an application, and the user clicks to download the application or clicks to view details of the application, then the application is a history clicked object.

Alternatively, the expected value may be an expected profit, may be a Q value, and may also be a probability.

The current state parameters obtained by the first neighbor object of the historical click object based on the historical click object and obtained according to the knowledge graph can better capture the interests and hobbies of the user, and the accuracy of the recommendation result is improved; and an effective recommendation object candidate set is screened out based on the knowledge graph, so that the whole action space does not need to be traversed, and the recommendation efficiency is improved.

In one possible embodiment, obtaining the current state parameter according to a plurality of historical click objects and the first neighbor object of each historical click object thereof includes:

calculating to obtain a target characteristic vector of each historical click object in a plurality of historical click objects and a first neighbor object of the historical click object; and acquiring current state parameters according to the target characteristic vectors of the plurality of historical click objects.

Further, the obtaining of the target feature vector of the historical click object by calculation according to each historical click object of the plurality of historical click objects and the first neighbor object of the historical click object includes:

mapping each historical click object and a first neighbor object thereof in a plurality of historical click objects into a vector respectively to obtain a feature vector of the historical click object and a feature vector of the first neighbor object thereof; calculating according to the feature vector of the first neighbor object of each historical click object in a plurality of historical click objects to obtain the neighbor feature vector of the historical click object; and calculating a target characteristic vector of each historical click object according to the characteristic vector and the neighbor characteristic vector of each historical click object in the plurality of historical click objects.

The neighbor feature vector of any historical click object v in the multiple historical click objects is as follows:

n (v) represents an object adjacent to the historical click object v within k order on the knowledge graph, namely, a first neighbor object of the historical click pair v, | N (v) | is the number of objects adjacent to the historical click object v within k order on the knowledge graph,

the feature vector of any neighbor object u in the objects which are adjacent to the historical click object v to the first order on the knowledge graph.

After the neighbor feature vector of the historical click object v is obtained, the feature vector of the historical click object v is obtained

And its neighbor feature vector

And obtaining a target feature vector of the historical click object v. The target feature vector of the historical click object v is as follows:

wherein the content of the first and second substances,

feature vector, W, for historical click object v_kAnd B_kAs a weight, σ is the activation function.

In a possible embodiment, obtaining the current state parameter according to a plurality of historical click objects and the first neighbor object of each historical click object thereof includes:

inputting each historical click object and a first neighbor object thereof in a plurality of historical click objects into a state generation model for calculation to obtain a current state parameter;

the method for calculating the current state parameter includes the steps that a state generation model comprises a graph convolution network GCN and a gating circulation unit GRU, a plurality of historical click objects, each historical click object in the plurality of historical click objects and a first neighbor object of each historical click object are input into the state generation model for calculation, and the current state parameter is obtained through the steps that:

inputting each historical click object in a plurality of historical click objects and a first neighbor object of the historical click object into a GCN for calculation to obtain a target feature vector of the historical click object; and inputting the target characteristic vectors of the plurality of historical click objects into the GRU for calculation so as to obtain the current state parameters.

In a possible embodiment, the calculating the expected value of each candidate recommendation object in the candidate recommendation object set according to the current state parameter includes:

inputting the current state parameters and each candidate recommended object in the candidate recommended object set into a selection strategy model for calculation to obtain an expected value of each candidate recommended object,

wherein, the selection policy model comprises a Value network or an advantaged network.

When the expected value is a Q value, the Q value of the candidate recommendation object can be expressed as:

wherein Q is^*(s_t+1,a_t+1) Representing the long-term benefit available to the candidate recommendation object in the future, gamma representing the rate of discount on the long-term benefit in the future, r_tIndicating the immediate benefit, s, of the candidate recommendation object_tAs current state parameter, s_t+1Is a state parameter of the next moment, a_tRecommending objects for the current candidate, a_t+1And recommending the object as the candidate at the next moment.

In a possible embodiment, before obtaining the plurality of historical click objects and the first neighbor object of each historical click object in the plurality of historical click objects, the method of the present application further includes:

and acquiring a state generation model and a selection strategy model, wherein the state generation model and the selection strategy model are obtained through joint training.

In a possible embodiment, the state generating model and the selection policy model are obtained by joint training, which specifically includes:

inputting a plurality of object samples into the state generation model for calculation to obtain training state parameters; obtaining a plurality of candidate object samples according to the plurality of object samples; inputting the multiple candidate object samples and the training state parameters into a selection strategy model for calculation to obtain an expected value of each candidate object sample in the multiple candidate object samples, and determining the candidate object sample corresponding to the maximum expected value as a target object sample; and determining a loss value according to the recommended object, the target object sample and the loss function corresponding to the plurality of object samples, and adjusting parameters in the state generation model and the selection strategy model according to the loss value.

The state generation model and the selection strategy model are obtained through the joint training, and the model training efficiency and the model precision are improved.

In a second aspect, an embodiment of the present application provides a knowledge-graph-based recommendation apparatus, including:

the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a plurality of historical click objects and a first neighbor object of each historical click object in the plurality of historical click objects, and the first neighbor object of each historical click object comprises an object which is adjacent to the historical click object in the k order on a knowledge graph; k is an integer greater than 0; acquiring current state parameters according to a plurality of historical click objects and a first neighbor object of each historical click object;

a determining unit, configured to determine a set of candidate recommended objects according to the knowledge graph, all the objects, and the plurality of historical click objects, where the set of candidate recommended objects includes the plurality of historical click objects and a second neighbor object that includes an object in the knowledge graph that is adjacent to each of the plurality of historical click objects within m-th order; m is an integer greater than 0; all the objects comprise a plurality of historical click objects and a first neighbor object of each historical click object;

a calculating unit, configured to calculate an expected value of each candidate recommended object in the candidate recommended object set according to the current state parameter,

and the determining unit is also used for determining the candidate recommended object corresponding to the maximum expected value as the target recommended object.

In a possible embodiment, in terms of obtaining the current state parameter according to the plurality of historical click objects and the first neighbor object of each historical click object thereof, the obtaining unit is specifically configured to:

In a feasible embodiment, in the aspect of obtaining the target feature vector of the historical click object by calculation according to each historical click object of the multiple historical click objects and the neighboring object of the historical click object, the obtaining unit is specifically configured to:

In a possible embodiment, in obtaining the current state parameter according to the plurality of historical click objects and the first neighbor object of each historical click object in the plurality of historical click objects, the obtaining unit is configured to:

inputting a plurality of historical click objects and a first neighbor object of each historical click object into a state generation model for calculation to obtain a current state parameter;

In a possible embodiment, the computing unit is specifically configured to:

and inputting the current state parameters and each candidate recommended object in the candidate recommended object set into a selection strategy model for calculation to obtain an expected Value of the candidate recommended object, wherein the selection strategy model comprises a Value network or an advantaged network.

In a possible embodiment, the obtaining unit is further configured to:

before obtaining a plurality of historical click objects and a first neighbor object of each historical click object in the plurality of historical click objects, obtaining a state generation model and a selection strategy model, wherein the state generation model and the selection strategy model are obtained through joint training.

In a third aspect, an embodiment of the present application provides another recommendation apparatus, including:

a memory to store instructions; and

a processor coupled with the memory;

wherein the instructions, when executed by the processor, cause the processor to perform some or all of the method of the first aspect.

In a fourth aspect, an embodiment of the present application provides a chip system, where the chip system is applied to an electronic device; the chip system comprises one or more interface circuits, and one or more processors; the interface circuit and the processor are interconnected through a line; the interface circuit is to receive a signal from a memory of the electronic device and to send the signal to the processor, the signal comprising computer instructions stored in the memory; when the processor executes the computer instructions, the electronic device performs part or all of the method according to the first aspect.

In a fifth aspect, the present application provides a computer storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, cause the processor to perform some or all of the method according to the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product, which includes computer instructions that, when executed on a recommendation apparatus, cause the recommendation apparatus to perform part or all of the method according to the first aspect.

It can be seen that in the scheme of the embodiment of the application, the current state parameters are obtained by clicking the object and the neighbor objects thereof based on the history, so that the current state parameters not only contain the historical behavior preference of the user, but also contain the information of the neighbor objects, the representation range and the capability of the model are improved, and the interests and hobbies of the user can be captured more accurately; effective candidate recommendation object sets are screened out based on the knowledge graph, so that the whole action space does not need to be traversed, the training efficiency can be improved, and the prediction efficiency is also improved; sampling efficiency is also promoted simultaneously, and through the utilization ratio that improves the sample for the model just can train out the effect the same with other models with less sample.

These and other aspects of the present application will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1a is a schematic view of a knowledge graph;

FIG. 1b is a schematic diagram of a reinforcement learning based recommendation system framework according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a recommendation method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of first and second order objects under a knowledge-graph;

fig. 4 is a schematic diagram of a GRU network according to an embodiment of the present application;

FIG. 5 is a schematic illustration of another knowledge-graph provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a recommendation device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of another recommendation device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings.

First, the working principle of the recommendation method based on reinforcement learning is introduced. When a request triggered by a user is received, the reinforcement learning-based recommendation system generates a recommendation system state parameter(s) according to the request and corresponding information_t) Determining a recommendation object (such as recommending an article) according to the state parameters of the recommendation system, sending the selected recommendation object to the user, giving a certain behavior (such as clicking, downloading and the like) aiming at the recommendation object after the user receives the recommendation object, and generating a numerical value based on the behavior given by the user by the recommendation system, wherein the numerical value is called a system reward value; generating a next recommendation system state parameter(s) based on the reward value and the recommendation object_t+1) Then recommending system state parameters(s) from the current_t) Jump to next recommended system state parameter(s)_t+1). The process is repeated, so that the recommendation result of the system is finally and more fit with the requirements of the user.

A common knowledge graph is a multi-relationship graph, which includes multiple types of nodes and multiple types of edges, and the nodes in the knowledge graph are usually represented by "entities" and the edges in the knowledge graph are usually represented by "relationships". FIG. 1a is a simple diagram of a knowledge graph. As shown in fig. 1a, the entities in the knowledge-graph include "people" and "companies," which have "friends," "duties," and so on.

The recommendation method in the embodiment of the application can be applied to various application scenarios, such as mobile phone application markets, content recommendation of content platforms, unmanned driving, games and the like. An example is first described by taking APP recommendation in the mobile phone application market as an example. When a user opens a mobile phone application market, the application market is triggered to give recommended applications to the user, the application market recommends one or a group of applications (namely target recommendation objects) to the user according to actions such as downloading and clicking of user history, characteristic information (namely recommendation system state parameters) of the user and the applications, and the like, and the characteristics of the applications comprise characteristics such as application types, developer information, development duration and the like; the user gives certain behaviors to the application recommended by the application market; the reward value is obtained according to the behavior of the user. The definition of the reward depends on the specific application scenario. For example, in the market of mobile applications, the value of the reward may be defined as the amount of download, the amount of click-through or the amount of payment made by the user within the application, etc. The application market aims to enable the system recommendation application to meet the requirements of users more and more through reinforcement learning, and meanwhile, the income of the application market is improved.

Referring to fig. 1b, the present embodiment provides a recommendation system architecture 100. The data collection device 160 is configured to collect a plurality of training sample data from the network and store the training sample data in the database 130, and the training device 120 generates the state generation model/selection policy model 101 based on the training sample data maintained in the database 130. The calculation module 111 determines a current state parameter according to the state generation model in the state generation model/selection policy model 101, the plurality of historical click objects, and a first neighbor object of each historical click object in the plurality of historical click objects, obtains a plurality of candidate recommendation objects based on the plurality of historical click objects, determines a target recommendation object from the plurality of candidate recommendation objects according to the selection policy model and the current state parameter, and feeds back the target image object to the user. How the training device 120 derives the state generative model/selection strategy model 101 based on training sample data will be described in more detail below.

The models (including the state generation model/selection policy model 101) in the implementation of the present application may be implemented by a neural network, such as a fully-connected neural network, a deep neural network, and the like. Wherein the operation of each layer in the deep neural network can be expressed mathematically

To describe. Wherein, W is the weight,

is the input vector (i.e., input neuron), b is the bias data,

is the output vector (i.e., output neuron), a is a constant. From the work of each layer in the physical-level deep neural network, it can be understood that the transformation of the input space into the output space (i.e. the row space to the column space of the matrix) is accomplished by five operations on the input space (set of input vectors), which include: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein 1, 2, 3 are operated by

The operation of 4 is completed by + b, and the operation of 5 is realized by a (). The expression "space" is used herein because the object being classified is not a single thing, but a class of things, and space refers to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the deep neural network is to finally obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained neural network. Therefore, the training process of the deep neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

Because it is desirable that the output of the deep neural network is as close as possible to the value actually desired to be predicted, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the value actually desired to be predicted, and then updating the weight vector according to the difference between the predicted value and the value actually desired (of course, there is usually an initialization process before the first update, that is, parameters are configured in advance for each layer in the deep neural network). Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

It should be noted that the state generating model includes a graph convolution network GCN and a GRU network, and the state generating model and the selection policy model (i.e., the GCN, the GRU network, and the selection policy model) are jointly trained. For example, the training device 120 inputs a plurality of historical click objects into the state generation model to obtain the user state parameters; and then, acquiring a target recommendation object from a plurality of candidate recommendation objects by selecting a strategy model and a user state parameter, wherein the candidate recommendation objects are obtained based on a plurality of historical click objects. The training device 120 calculates historical click objects corresponding to the target recommendation object and the plurality of historical click objects, calculates a loss value according to a loss function, and adjusts parameters in the GCN, the GRU network and the selection strategy model based on the loss value; the above operations are repeated to achieve the purposes of training the state generation model and selecting the strategy model.

The state generative model/selection policy model 101 obtained by the training device 120 may be applied in different systems or devices. In fig. 1b, the execution device 110 is configured with an I/O interface 112, and performs data interaction with an external device, such as sending a target recommendation object to the user device 140, and the "user" can output a click object of the user to the I/O interface 112 through the user device 140.

The execution device 110 may call a plurality of historical click objects stored in the data storage system 150 to determine a target recommendation object, or may store the target recommendation object in the data storage system 150.

The calculation module 111 makes recommendations using the state generative model/selection policy model 101. Specifically, after obtaining a plurality of historical clicks, the calculating module 111 determines a current state parameter for the plurality of historical click objects through a state generation model, then inputs the current state parameter and the plurality of candidate recommendation objects into a selection policy to calculate, obtains a Q value of each recommendation object in the plurality of candidate recommendation objects, and determines the candidate recommendation object corresponding to the maximum Q value as the target recommendation object.

Finally, the I/O interface 112 returns the target recommendation object to the user device 140 for presentation to the user.

Further, the training device 120 may generate corresponding state generation models/selection policy models 101 based on different data for different goals to provide better results to the user.

In the case shown in fig. 1b, the user can view the target recommendation object output by the execution device 110 at the user device 140, and the specific presentation form may be a display, a sound, an action, and the like. The user device 140 may also be used as a data acquisition end to store the acquired training sample data in the database 130.

It should be noted that fig. 1b is only a schematic diagram of a system architecture provided in the embodiment of the present application, and the position relationship between the devices, modules, etc. shown in the diagram does not constitute any limitation, for example, in fig. 1b, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

The training device 120 obtains the samples from the database 130 to obtain recommendation information of one or more rounds, the recommendation information includes a target recommendation object and a user click object, and generates the model/selection strategy model 101 according to the recommendation information training state of the one or more rounds.

In one possible embodiment, the training of the state generation model and the selection policy model described above is performed off-line, i.e., the training device 120 and the database are independent of the user device 140 and the execution device 110; for example, the training device 120 is a third-party server, and the state-generating model and the selection policy model are obtained from the third-party server before the execution device 110 works.

In one possible embodiment, the training device 120 is integrated with the performance device 110, and the performance device 110 is disposed in the user device 140.

After obtaining the state generating model and the selection policy, the execution device 110 obtains a plurality of historical click objects from the data storage system 150, calculates the plurality of historical click objects through the state generating model to generate a current state parameter, and then processes the current state parameter and a plurality of candidate recommendation objects through the selection policy model to obtain a target recommendation object recommended to the user. The user gives feedback to the target recommendation object and whether to click the target recommendation object; the object clicked by the user (i.e., the click object) is stored in the database 130, and may also be stored in the data storage system 150 by the execution device 110 for making a next recommendation object.

In one possible embodiment, the recommendation system architecture described above includes only the database 130 and no data storage system 150. After the user device 140 receives the target recommendation object output by the execution device 110 through the I/O interface 112, the user device 140 stores the target recommendation object and the click object in the database 130 to train the state generating model/selection policy model 101.

Referring to fig. 2, fig. 2 is a schematic flowchart of a knowledge-graph-based recommendation method according to an embodiment of the present application.

As shown in fig. 2, the method includes:

s201, obtaining a plurality of historical click objects and a first neighbor object of each historical click object in the plurality of historical click objects.

Wherein the first neighbor object of each historical click object comprises objects that are neighbors on the knowledge graph within k-th order of the historical click object. The historical click object is an object which is recommended to the user before and clicked by the user. For example, the object recommended to the user is an application, and the user clicks to download the application or clicks to view details of the application, then the application is a history clicked object.

For example, the left diagram in FIG. 3 shows a knowledge graph of historical click object 0; the right diagram in FIG. 3 shows objects that are second order neighbors of the historical click object 0. As shown in the right diagram of fig. 3, objects adjacent to the 0 th order of the history click object include

objects

1, 2, 3, 4, and 6, and objects adjacent to the 0 th order of the history click object 0 include

objects

6, 0, 7, 3, 2, 4, 12, 11, 9, 5, 1, and 8. The first-order neighboring objects of the object 1 include

objects

6 and 0, the first-order neighboring objects of the object 2 include

objects

0, 7, and 3, the first-order neighboring objects of the object 3 include objects 2, 4, and 12, the first-order neighboring objects of the object 4 include

objects

3, 11, 9, and 5, and the first-order neighboring objects of the object 6 include objects 1 and 8. As can be seen from the above description, the first-order neighboring object of the first-order neighboring objects of the history click object 0 is the second-order neighboring object of the history click object 0. Objects adjacent to the 2 nd order of the historical click object 0 in the knowledge-graph include first order adjacent objects and second order adjacent objects of the historical click object 0.

And the k order in the neighbor objects in the k order of the historical click object v is used for representing the association degree between the neighbor objects in the k order and the historical click object v, and the higher the k value is, the lower the corresponding association degree is. For example, if The historical click object v is steel warfare Ridge (Hacksaw Ridge), The first-order neighbor objects of The historical click object v may include war slides such as Pearl Harbor (Pearl Harbor), Flags of parents (Flags of Our farmer), etc., and The first-order neighbor objects of The historical click object v may include movies directed by director Mel gibbsin (Mel Columcille Gerard Gibson) of steel warfare Ridge, such as brave heart (brave heart), enlightenment (Apocalypto), and yersinit (The conference of The Christ), etc.

It should be noted here that the value of k may be set manually according to different situations, for example, the value of m may be reduced to reduce the number of neighbor objects, thereby reducing the time required for calculating the feature vector of the neighbor object, and further improving the recommendation efficiency.

S202, obtaining current state parameters according to each historical click object and the first neighbor object thereof in the plurality of historical click objects.

Optionally, a feature vector and a neighbor feature vector of each historical click object in a plurality of historical click objects are obtained, wherein the neighbor feature vector of each historical click object is obtained according to the feature vector of a first neighbor object of the historical click object; calculating a target characteristic vector of each historical click object according to the characteristic vector and the neighbor characteristic vector of each historical click object in a plurality of historical click objects; and calculating to obtain the current state parameters according to the target characteristic vectors of the plurality of historical click objects.

the feature vector of any neighbor object u in the objects which are adjacent to the historical click object v to the first order on the knowledge graph. As can be seen from this, it is,

the vector fusing the object information adjacent to the historical click object v on the k-th order on the knowledge graph is obtained.

And after the neighbor characteristic vector of the historical click object v is obtained, obtaining a target characteristic vector of the historical click object v according to the characteristic vector of the historical click object v and the neighbor characteristic vector thereof. Specifically, the feature vector of the historical click object v and the neighbor feature vector thereof are input into a full-connection network for calculation to obtain a target feature vector of the historical click object v, wherein the target feature vector of the historical click object v is as follows:

wherein the content of the first and second substances,

feature vector, W, for historical click object v_kAnd B_kσ is the activation function of the fully connected network, which is the weight in the fully connected network.

the method for calculating the current state parameter includes the following steps:

inputting a plurality of historical click objects, each historical click object and a first neighbor object thereof into the GCN for calculation to obtain a target feature vector of each historical click object in the plurality of historical click objects;

and inputting the target characteristic vectors of the plurality of historical click objects into the GRU for calculation so as to obtain the current state parameters.

After the target characteristic vector of each historical click object in a plurality of historical click objects is obtained, the target characteristic vectors of the historical click objects are input into a GRU network for calculation, so that the current state parameter is obtained.

Specifically, as shown in fig. 4, assuming that the number of the history clicked objects is n, the target feature vector of the history clicked object 1 is input into the GRU network to be calculated, so as to obtain a first intermediate vector, and then the first intermediate vector is spliced with the target feature vector of the history clicked object 2, so as to obtain a first spliced vector; inputting the first splicing vector into a GRU network for calculation to obtain a second intermediate vector; repeating the steps, after the n-1 intermediate vector is obtained, splicing the n-1 intermediate vector with the target characteristic vector of the historical click object n to obtain an n-1 spliced vector; and inputting the n-1 splicing vector into the GRU network for calculation to obtain the current state parameter.

The process of splicing the feature vector and the intermediate vector of the historical click object is illustrated. Assuming that the feature vector of the historical click target is (0,0,1) and the intermediate vector is (3,0), the feature vector (0,0,1) of the historical click target and the intermediate vector (3,0) are spliced to obtain a spliced vector which is (0,0,1,3, 0).

The term "vector" as referred to herein corresponds to any information having two or more elements associated with a respective vector dimension.

S203, determining a candidate recommendation object set according to the knowledge graph, all the objects and a plurality of historical click objects.

The candidate recommendation object set includes the plurality of historical click objects and objects adjacent to each historical click object in the plurality of historical click objects in the m-order in the knowledge graph, and a value of m may be set manually according to different situations, for example, according to different application scenarios or different requirements, and specifically, for example, to improve recommendation efficiency, the value of m may be reduced to reduce the number of recommendation objects in the candidate recommendation object set. All the objects are all elements of a certain category or a certain field, for example, if the object is an application, all the objects include all applications in the application store. All objects may of course also have all elements under other categories, which is not specifically limited in this application.

Taking the example of recommending a movie to a user as an example, the knowledge graph is shown in fig. 5, where the knowledge graph includes nodes such as "movie", "classification", "country", "language", and the like, and includes relationships such as "belongs to the country" and "belongs to the type". For example, The movie "Hilton in river (The Hanoi Hilton)" belongs to The category of drama and war, and is an English movie from The United states. Suppose that the user has seen two movies, namely, the Five-Year appointment and the metalaxy. If the set of candidate recommended objects includes objects that are 2 nd order neighbors of the historical recommended objects on the knowledge graph, the set of candidate recommended objects includes "five years of appointment", "Metal Party", and "Hill-Tung in river"; further, assume that the user has viewed two movies, namely, "blood warfare rhine (precision Before Dawn)" and "clionaire's million fudge (Millionaire for christ)", that the plurality of historical click objects include "blood warfare rhine" and "clionaire's million fudge". If the set of candidate recommended objects includes objects that are 2 nd order neighbors of the historical recommended objects on the knowledge graph, the set of candidate recommended objects includes "blood war rhine", "creston's million fuwenes", "heuchun" and "chuqinshu (Chu chi Chow)"

S204, calculating according to the current state parameters to obtain the expected value of each candidate recommended object in the candidate recommended object set, and determining the candidate recommended object corresponding to the maximum expected value as the target recommended object.

Optionally, the expected value includes an expected profit, a Q value, or a probability.

Specifically, the current state parameter and each candidate recommended object in the candidate recommended object set are input into a selection policy model for calculation, so as to obtain an expected value of each candidate recommended object. In one possible implementation, when the expected value is a Q value, the Q value is used to represent the long-term benefit of the candidate recommendation object. Wherein, the Q value of the candidate recommendation object is as follows:

After the Q value of each candidate recommendation object in the candidate recommendation object set is obtained, the candidate recommendation object corresponding to the maximum Q value is determined as the target recommendation object.

And the Value network is used for evaluating the object or action output by the front end and giving an evaluation Value, such as the expected Value and the like.

and the abstract network is used for selecting an optimal object or action from the objects or actions output from the front end.

Further, after the target recommendation object is determined, the target recommendation object is pushed to the user. And when the user clicks the target recommendation object, the target recommendation object becomes a new historical recommendation object and serves as a basis for carrying out recommendation next time.

It should be noted that the scheme of the present application is applicable to various recommendation scenarios, such as video recommendation in a video application, music recommendation in a music application; merchandise recommendations in shopping applications, and the like.

Taking the example of recommending a movie to a user as an example, the knowledge graph is shown in fig. 5, where the knowledge graph includes nodes such as "movie", "classification", "country", "language", and the like, and includes relationships such as "belongs to the country" and "belongs to the type". For example, The movie "Hilton in river (The Hanoi Hilton)" belongs to The category of drama and war, and is an English movie from The United states. Suppose that the user has seen two movies, namely, the Five-Year appointment and the metalaxy. And inputting the 'five-year appointment' and 'metal party' of the movie into the state generation model for calculation to obtain the current state parameters. The GCN is used for mapping the 'five-year-old appointment' and the 'metal party' into a vector which is called a target feature vector, and the target feature vector of the 'five-year-old appointment' and the target feature vector of the 'metal party' are input into the GRU network for calculation to obtain the current state parameters.

If the set of candidate recommended objects includes objects that are 2 nd order neighbors on the knowledge graph of the historical recommended objects, the set of candidate recommended objects includes "five years of appointment", "metal parties", and "river Hill Tung".

Three movies of 'five-year appointment', 'metal party' and 'river Hildoton' are evaluated by selecting a strategy model and current state parameters to obtain three Q values, wherein the three Q values are 0.1,0.2 and 0.3 respectively; the movie "hilton in river" is determined as a target recommendation object, and is recommended to the user.

In one possible embodiment, the state generation model described above is implemented based on GCN and GRU. The GRU is a characteristic Recurrent Neural Networks (RNN).

The RNN is used for processing sequence data. In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are all connected, and each node between every two layers is connectionless. Although the common neural network solves a plurality of problems, the common neural network still has no capability for solving a plurality of problems. For example, you would typically need to use the previous word to predict what the next word in a sentence is, because the previous and next words in a sentence are not independent. The RNN is called a recurrent neural network, i.e., the current output of a sequence is also related to the previous output. The concrete expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more but connected, and the input of the hidden layer not only comprises the output of the input layer but also comprises the output of the hidden layer at the last moment. In theory, RNNs can process sequence data of any length. The error back propagation algorithm is used for RNN training, but with a little difference: that is, if the RNN is network-deployed, parameters therein, such as the weight W, are shared; this is not the case with the conventional neural networks described above by way of example. And in using the gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the networks of the previous steps. This learning algorithm is called a time-based back propagation time (BPTT).

RNNs aim at making machines capable of memory like humans. Therefore, the output of the RNN needs to be dependent on the current input information and historical memory information.

Further, the recurrent neural network includes a Simple Recurrent Unit (SRU) network. The SRU network has the advantages of simplicity, rapidness, and better explanatory property.

It should be noted that the recurrent neural network may also adopt other specific implementation forms.

In an optional embodiment, before the recommendation, a state generation model and a selection strategy model are obtained, wherein the state generation model and the selection strategy model are obtained through joint training.

It should be noted that, specifically, the state generator model and the selection strategy model are obtained by joint training, that is, a loss function is used for training the state generator model and the selection strategy model.

Specifically, training sample data is obtained, wherein the training sample data comprises a plurality of object samples and corresponding recommended objects, and the plurality of object samples are input into a state generation model for calculation to obtain training state parameters; obtaining a plurality of candidate object samples based on the plurality of object samples; inputting the multiple candidate object samples and the training state parameters into a selection strategy model for calculation to obtain an expected value of each candidate object sample in the multiple candidate object samples, and determining a candidate sample object corresponding to the maximum expected value as a target object sample; and finally, determining a loss value according to the loss function, the recommended object corresponding to the plurality of object samples and the target object sample, adjusting parameters in the state generation model and the selection strategy model based on the loss value, and repeating the steps so as to achieve the purposes of training the state generation model and selecting the strategy model.

The scheme of the application is tested by a Book-crossing data set and a Movielens-20M data set, and the following results are obtained:

TABLE 1 statistical information of data sets

The test set-up was as follows:

the training set and the test set in the data set are segmented according to users, 80% of the users are used as the training set, and 20% of the users are used as the test set.

In terms of prediction accuracy, we choose the industry accepted test index Precision (higher is better), Recall (higher is better). At the same time we compare the rewards (rewarded) brought by the various models. In the measurement Precision, Recall, the length of the list is set to 32, and when the item score is higher than 3.0 minutes, it is considered to be relevant, otherwise it is not relevant.

Reward definitionWeighting the current scoring and user history scoring conditions of the user on the article: r (s, a) ═ R_ij+α×(c_p-c_n) Wherein r is_ijRepresents the score of the user i on the item j (if the score is not in the data set, the score is 0 by default), c_pAnd c_nIndicates the number of continuous positive scores and the number of continuous negative scores, and alpha is a weight. Each recommendation round (epsilon) length is set to 32. In the experiment, 2 nd order neighbor objects are considered at the maximum.

Through testing, the following conclusions can be obtained:

the application achieves the best experimental results on both true data sets (p value is less than 0.05, which indicates that the results are significant) in terms of prediction precision (precision) and reward (reward). In the model training and decision making processes, fewer samples are needed to achieve the same benefits, and compared with the existing work, the method has the advantages of being high in sample collection efficiency and low in training prediction time complexity.

It can be seen that in the scheme of the embodiment of the application, the current state parameters are obtained by clicking the object and the neighbor objects thereof based on the history, so that the current state parameters not only contain the historical behavior preference of the user, but also contain the information of the neighbor objects, the representation range and the capability of the model are improved, the interest and the hobbies of the user can be captured more accurately, and the accuracy of recommending the object is improved; effective candidate recommendation object sets are screened out based on the knowledge graph, so that the whole action space does not need to be traversed, the training efficiency can be improved, and the prediction efficiency is also improved; sampling efficiency is also promoted simultaneously, and through the utilization ratio that improves the sample for the model just can train out the effect the same with other models with less sample.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a knowledge-graph-based recommendation apparatus according to an embodiment of the present application. As shown in fig. 6, the recommendation apparatus 600 includes:

an obtaining unit 601, configured to obtain a plurality of historical click objects and a first neighbor object of each of the plurality of historical click objects, where the first neighbor object of each of the historical click objects includes an object that is adjacent to the historical click object in k-th order on a knowledge graph; k is an integer greater than 0; acquiring current state parameters according to a plurality of historical click objects and a first neighbor object of each historical click object;

a determining unit 602, configured to determine a set of candidate recommended objects according to the knowledge graph, all the objects, and the plurality of historical click objects, where the set of candidate recommended objects includes the plurality of historical click objects and a second neighbor object that includes an object in the knowledge graph that is adjacent to each of the plurality of historical click objects within m-th order; m is an integer greater than 0; all the objects comprise a plurality of historical click objects and a first neighbor object of each historical click object;

a calculating unit 603, configured to calculate, according to the current state parameter, an expected value of each candidate recommended object in the candidate recommended object set,

the determining unit 602 is further configured to determine the candidate recommended object corresponding to the maximum expected value as the target recommended object.

In a possible embodiment, in terms of obtaining the current state parameter according to a plurality of historical click objects and the first neighbor object of each historical click object thereof, the obtaining unit 601 is specifically configured to:

In a possible embodiment, in terms of obtaining a target feature vector of a history click object by calculating according to each history click object of a plurality of history click objects and a neighbor object of the history click object, the obtaining unit 601 is specifically configured to:

mapping each historical click object and a first neighbor object thereof in a plurality of historical click objects into a vector respectively to obtain a feature vector of each historical click object and a feature vector of the first neighbor object thereof; calculating according to the feature vector of the first neighbor object of each historical click object in a plurality of historical click objects to obtain the neighbor feature vector of the historical click object; and calculating a target characteristic vector of each historical click object according to the characteristic vector and the neighbor characteristic vector of each historical click object in the plurality of historical click objects.

In a possible embodiment, in terms of obtaining the current state parameter according to a plurality of historical click objects and the first neighbor object of each historical click object thereof, the obtaining unit 601 is configured to:

inputting a plurality of historical click objects, each historical click object in the plurality of historical click objects and a first neighbor object thereof into a state generation model for calculation to obtain a current state parameter;

In a possible embodiment, the computing unit 603 is specifically configured to:

In a possible embodiment, the obtaining unit 601 is further configured to:

It should be noted that the above units (the acquiring unit 601, the determining unit 602, and the calculating unit 603) are configured to execute relevant contents of the methods shown in the above steps S201 to S204.

In the present embodiment, the recommendation device 600 is presented in the form of a unit. As used herein, a unit may refer to a specific application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that may provide the described functionality. Further, the above acquisition unit 601, determination unit 602, and calculation unit 603 may be realized by the processor 700 of the recommendation device shown in fig. 7.

The recommendation device as shown in fig. 7 may be implemented in the structure of fig. 7, and the recommendation device 700 includes at least one processor 701, at least one memory 702 and at least one communication interface 703. The processor 701, the memory 702, and the communication interface 703 are connected by a communication bus and communicate with each other.

Communication interface 703 is used for communicating with other devices or communication networks, such as ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), etc.

The memory 702 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.

The memory 702 is used for storing application program codes for executing the above schemes, and the processor 701 controls the execution. The processor 701 is configured to execute application program code stored in the memory 702.

The memory 702 stores code that may perform one of the knowledge-graph based recommendation methods provided above.

The processor 701 may also employ or one or more integrated circuits for executing related programs to implement the recommendation method or the model training method of the embodiments of the present application.

The processor 701 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the proposed method may be implemented by integrated logic circuits in hardware or by instructions in software in the processor 701. In implementation, the steps of the training method for the state generation model and the selection strategy of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 701. The processor 701 may also be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an FPGA (field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The various methods, steps and block diagrams of modules disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 702, and the processor 701 reads information in the memory 702 and completes the graph-based recommendation method according to the embodiment of the application in combination with hardware thereof.

The communication interface 703 enables communication between the recommendation device and other devices or communication networks using transceiver means such as, but not limited to, transceivers. For example, the historical click object and its neighbor objects may be obtained through the communication interface 703.

A bus may include a pathway to transfer information between various components of the device (e.g., memory 702, processor 701, communication interface 703). In one possible embodiment, the processor 701 specifically performs the following steps:

acquiring a plurality of historical click objects and a first neighbor object of each historical click object in the plurality of historical click objects, wherein the first neighbor object of each historical click object comprises an object which is adjacent to the historical click object in the k order on a knowledge graph; k is an integer greater than 0; acquiring current state parameters according to a plurality of historical click objects and a first neighbor object of each historical click object; determining a candidate set of recommended objects according to the knowledge graph, all objects and the plurality of historical click objects, wherein the candidate set of recommended objects comprises the plurality of historical click objects and a second neighbor object, and the second neighbor object comprises an object which is adjacent to each historical click object in the plurality of historical click objects in the knowledge graph within m order; m is an integer greater than 0; all the objects comprise a plurality of historical click objects and a first neighbor object of each historical click object; and calculating according to the current state parameters to obtain the expected value of each candidate recommended object in the candidate recommended object set, and determining the candidate recommended object corresponding to the maximum expected value as the target recommended object.

In one possible embodiment, in obtaining the current state parameter according to a plurality of historical click objects and the first neighbor object of each historical click object, the processor 701 specifically performs the following steps:

In one possible embodiment, in terms of calculating a target feature vector of a historical click object according to each historical click object of a plurality of historical click objects and a first neighbor object of the historical click object, the processor 701 specifically performs the following steps:

the method for calculating the current state parameter includes the steps that a state generation model comprises a graph convolution network GCN and a gating circulation unit GRU, a plurality of historical click objects and first neighbor objects of each historical click object are input into the state generation model for calculation, and current state parameters are obtained, wherein the method includes the following steps:

In one possible embodiment, the processor 701 specifically performs the following steps:

In one possible embodiment, the processor 701 further specifically performs the following steps:

before obtaining a plurality of historical click objects and neighbor objects of each historical click object in the plurality of historical click objects, obtaining a state generation model and a selection strategy model, wherein the state generation model and the selection strategy model are obtained through joint training.

The present application provides a computer storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform some or all of the steps of any of the recommended methods described in the above method embodiments.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: various media capable of storing program codes, such as a U disk, a ROM, a RAM, a removable hard disk, a magnetic disk, or an optical disk.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in view of the above, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A knowledge graph-based recommendation method is characterized by comprising the following steps:

obtaining a plurality of historical click objects and a first neighbor object of each historical click object in the plurality of historical click objects, wherein the first neighbor object of each historical click object comprises an object adjacent to the historical click object in the k-order on a knowledge graph; k is an integer greater than 0;

obtaining current state parameters according to the plurality of historical click objects and the first neighbor object of each historical click object;

determining a set of candidate recommended objects from the knowledge-graph, all objects, and the plurality of historical click objects, the set of candidate recommended objects including the plurality of historical click objects and a second neighbor object including objects in the knowledge-graph that are adjacent to each of the plurality of historical click objects within m-th order; m is an integer greater than 0; the all objects comprise the plurality of historical click objects and a first neighbor object of each historical click object;

and calculating an expected value of each candidate recommended object in the candidate recommended object set according to the current state parameters, and determining the candidate recommended object corresponding to the maximum expected value as a target recommended object.

2. The method of claim 1, wherein the expected value comprises a Q value, an expected benefit, or a probability.

3. The method according to claim 1 or 2, wherein the obtaining the current state parameter according to the plurality of historical click objects and the first neighbor object of each historical click object comprises:

calculating to obtain a target characteristic vector of each historical click object in the plurality of historical click objects according to the historical click object and a first neighbor object of the historical click object;

and acquiring the current state parameters according to the target characteristic vectors of the plurality of historical click objects.

4. The method of claim 3, wherein computing the target feature vector of each of the plurality of historical click objects according to the historical click object and the first neighboring object of the historical click object comprises:

mapping each historical click object and the first neighbor object thereof into a vector respectively to obtain a characteristic vector of each historical click object and a characteristic vector of the first neighbor object thereof;

calculating according to the feature vector of the first neighbor object of each historical click object to obtain the neighbor feature vector of the historical click object;

and calculating to obtain a target characteristic vector of each historical click object according to the characteristic vector of each historical click object and the neighbor characteristic vector.

5. The method according to any one of claims 1 to 4, wherein the obtaining the current state parameter according to the plurality of historical click objects and the first neighbor object of each historical click object comprises:

inputting each historical click object and a first neighbor object thereof in the plurality of historical click objects into a state generation model for calculation to obtain the current state parameter;

the state generation model comprises a graph convolution network GCN and a gating cycle unit GRU, and the step of inputting the plurality of historical click objects, each historical click object in the plurality of historical click objects and a first neighbor object thereof into the state generation model for calculation to obtain the current state parameter comprises the following steps:

inputting each historical click object in the plurality of historical click objects and a first neighbor object of the historical click object into the GCN for calculation to obtain a target feature vector of the historical click object;

and inputting the target characteristic vectors of the plurality of historical click objects into the GRU for calculation so as to obtain the current state parameter.

6. The method according to any one of claims 1 to 5, wherein the calculating an expected value of each candidate recommendation object in the set of candidate recommendation objects according to the current state parameter comprises:

inputting the current state parameters and each candidate recommended object in the candidate recommended object set into a selection strategy model for calculation to obtain expected values of the candidate recommended objects,

wherein the selection policy model comprises a Value network or an advantage network.

7. The method of claim 6, wherein prior to obtaining the plurality of historical click objects and the first neighbor object of each of the plurality of historical click objects, the method further comprises:

and acquiring the state generating model and the selection strategy model, wherein the state generating model and the selection strategy model are obtained by joint training.

8. The method according to claim 7, wherein the state generator model and the selection strategy model are jointly trained, specifically comprising:

inputting a plurality of object samples into the state generation model for calculation to obtain training state parameters;

obtaining a plurality of candidate object samples according to the plurality of object samples;

inputting the candidate object samples and the training state parameters into a selection strategy model for calculation to obtain an expected value of each candidate object sample in the candidate object samples, and determining the candidate object sample corresponding to the maximum expected value as a target object sample;

and determining a loss value according to the recommended object, the target object sample and the loss function corresponding to the plurality of object samples, and adjusting parameters in the state generation model and the selection strategy model according to the loss value.

9. A knowledge-graph-based recommendation device, comprising:

the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a plurality of historical click objects and a first neighbor object of each historical click object in the plurality of historical click objects, and the first neighbor object of each historical click object comprises an object which is adjacent to the historical click object in the k order on a knowledge graph; k is an integer greater than 0; obtaining current state parameters according to the plurality of historical click objects and the first neighbor object of each historical click object;

a determining unit, configured to determine a set of candidate recommended objects according to the knowledge graph, all objects, and the plurality of historical click objects, where the set of candidate recommended objects includes the plurality of historical click objects and a second neighbor object that includes an object in the knowledge graph that is adjacent to each of the plurality of historical click objects within m-th order; m is an integer greater than 0; the all objects comprise the plurality of historical click objects and a first neighbor object of each historical click object;

and the calculation unit is used for calculating an expected value of each candidate recommended object in the candidate recommended object set according to the current state parameter and determining the candidate recommended object corresponding to the maximum expected value as the target recommended object.

10. The apparatus of claim 9, wherein the expected value comprises a Q value, an expected benefit, or a probability.

11. The apparatus according to claim 9 or 10, wherein, in the aspect of obtaining the current state parameter according to the plurality of historical click objects and the first neighbor object, the obtaining unit is specifically configured to:

12. The apparatus according to claim 11, wherein in the aspect that the target feature vector of the historical click object is obtained by calculating according to each historical click object of the plurality of historical click objects and the first neighbor object of the historical click object, the obtaining unit is specifically configured to:

13. The apparatus according to any one of claims 9-12, wherein in the aspect of obtaining the current state parameter according to the plurality of historical click objects and the first neighbor object of each historical click object, the obtaining unit is configured to:

14. The apparatus according to any one of claims 9 to 13, wherein the computing unit is specifically configured to:

15. The apparatus of claim 14, wherein the obtaining unit is further configured to:

before obtaining a plurality of historical click objects and a first neighbor object of each historical click object in the plurality of historical click objects, obtaining the state generation model and the selection strategy model, wherein the state generation model and the selection strategy model are obtained through joint training.

16. The apparatus according to claim 15, wherein the state generator model and the selection policy model are jointly trained, and specifically include:

17. A recommendation device, comprising:

a memory to store instructions; and

a processor coupled with the memory;

wherein the processor, when executing the instructions, performs the method of any one of claims 1-8.

18. A chip system, wherein the chip system is applied to an electronic device; the chip system comprises one or more interface circuits, and one or more processors; the interface circuit and the processor are interconnected through a line; the interface circuit is to receive a signal from a memory of the electronic device and to send the signal to the processor, the signal comprising computer instructions stored in the memory; the electronic device performs the method of any one of claims 1-8 when the processor executes the computer instructions.

19. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any one of claims 1-8.

20. A computer program product comprising computer instructions which, when run on a recommendation device, cause the recommendation device to perform the method of any one of claims 1-8.