CN117112834B

CN117112834B - Video recommendation method and device, storage medium and electronic device

Info

Publication number: CN117112834B
Application number: CN202311384218.7A
Authority: CN
Inventors: 胡克坤; 董刚; 曹其春; 杨宏斌
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-02-02
Anticipated expiration: 2043-10-24
Also published as: CN117112834A

Abstract

The application discloses a video recommendation method and device, a storage medium and an electronic device, wherein the video recommendation method comprises the following steps: the method comprises the steps of obtaining a video feature information set, wherein the video feature information set comprises video feature information corresponding to each video in the video set, the video feature information is used for representing multi-mode fusion features of the corresponding video and relationship features between the corresponding video and other videos in the video set, the relationship features comprise features of the videos in multiple video watching dimensions, and the multi-mode fusion features comprise features of the video itself in multiple modes; under the condition that the video is recommended to the target user in the user set, determining the target video to be recommended to the target user in the video set according to the user characteristic information and the video characteristic information set of the target user; by adopting the technical scheme, the problems of low matching degree between the recommended video and the user and the like in the related technology are solved.

Description

Video recommendation method and device, storage medium and electronic device

Technical Field

The embodiment of the application relates to the field of computers, in particular to a video recommendation method and device, a storage medium and an electronic device.

Background

With the rapid popularization of internet technology, the rapid development of multimedia technology and the daily and monthly variation of social networks, as a new social form, "video social" is rapidly spreading. Unlike traditional social networks, social forms in video social networks are no longer constrained to text and pictures, but can also be live through posting video. The user can watch, comment and share the video on the video software platform, and can communicate with the video creator, so that the mental culture life of the user is greatly enriched. However, the increasingly complex video types and increasing numbers of videos, while giving users more choices, also create serious information overload problems. How to solve the problem, the user can find the favorite content in the wandering video, so as to meet the personalized requirements of the user, and the recommendation system of the video social platform is a great challenge.

The conventional video recommendation method mainly uses interactive data between users and videos to implement recommendation, and typical recommendation methods include a collaborative filtering-based method, a content-based method and a hybrid method. They typically extract embedded representations of the user and/or video from the auxiliary data by means of manual feature engineering and then feed into models of factoring machines, gradient hoists, etc. to predict the user's preferences for the video. The video recommendation method based on deep learning utilizes the strong representation learning capability of the neural network to learn the representation of a user and/or an object from the object auxiliary information, and then predicts based on the similarity of the user and the video, but most of the video recommendation methods only consider the video specific type auxiliary information, do not fully utilize the complete multi-mode auxiliary information of the video and the semantic relationship among the videos, and have unsatisfactory recommendation effect.

Aiming at the problems of low matching degree between the recommended video and the user and the like in the related technology, no effective solution has been proposed yet.

Disclosure of Invention

The embodiment of the application provides a video recommending method and device, a storage medium and an electronic device, and aims to at least solve the problems that in the related technology, the matching degree between a recommended video and a user is low and the like.

According to an embodiment of the present application, there is provided a video recommendation method, including:

acquiring a video feature information set, wherein the video feature information set comprises video feature information corresponding to each video in the video set, the video feature information is used for representing multi-modal fusion features of the corresponding video and relationship features between the corresponding video and other videos in the video set, the relationship features comprise features of the videos in multiple video watching dimensions, and the multi-modal fusion features comprise features of the video itself in multiple modalities;

under the condition that video is recommended to a target user in a user set, determining a target video to be recommended to the target user in the video set according to user characteristic information of the target user and the video characteristic information set;

Recommending the target video to the target user.

Optionally, the acquiring the video feature information set includes:

extracting features of semantic edges from a video multi-modal semantic graph of the video set as the relationship features, and acquiring the multi-modal fusion features of each video in the video set to obtain a fusion feature information set, wherein the video multi-modal semantic graph is used for displaying the relationship features between videos in the video set in the form of video vertices and the semantic edges, each video vertex represents one video, and each semantic edge represents one relationship feature;

and adding the relation features to the fusion feature information set to obtain the video feature information set.

Optionally, the extracting features of semantic edges from the multi-modal semantic graphs of videos in the video set as the relationship features, and obtaining the multi-modal fusion features of each video in the video set to obtain a fusion feature information set includes:

converting the video multi-mode semantic graph into a target video adjacency matrix, and obtaining the relation features according to the similarity degree of features between any two video vertexes in the video multi-mode semantic graph represented by the target video adjacency matrix in a plurality of video watching dimensions;

And fusing the characteristics of each video in the video set on a plurality of modes into fused characteristic information to obtain the fused characteristic information set, wherein the fused characteristic information is used for representing the multi-mode fused characteristics of the corresponding video.

Optionally, the converting the video multimodal semantic graph into the target video adjacency matrix includes:

acquiring a transition probability matrix corresponding to the video multi-mode semantic graph, wherein the transition probability matrix is used for indicating the transition probability of a random walker from one video vertex to each adjacent video vertex in the process of using a random walk construction algorithm to walk the video multi-mode semantic graph;

taking an ith video vertex in M video vertices in the video multi-mode semantic graph as a root vertex, taking the root vertex as a starting point of a random walk construction algorithm, expanding random walks with preset path length for P times according to a meta-path set, preset restart probability and the transition probability matrix to obtain Q long walk paths corresponding to the ith video vertex, wherein the Q long walk paths form a context of the ith video vertex, the meta-path set comprises meta-paths used for representing relation characteristics between each video and other videos in the video set, the restart probability is used for indicating the probability of each step of the random walk to jump back to the starting point in the process of each random walk, i is a positive integer greater than or equal to 1 and less than or equal to M, and Q is a positive integer less than or equal to P;

Sampling the size of a preset window for the Q long-distance walking paths in sequence to obtain an ith vertex pair list of the ith video vertex, wherein when each sampling is recorded in the ith vertex pair list, a pair of video vertices at two ends of the sampling are sampled, and the length of the size of the preset window is more than 2 and less than the length of the preset path;

and generating the target video adjacency matrix according to M vertex pair lists corresponding to the M video vertices.

Optionally, the expanding the random walk with the preset path length for P times by taking the root vertex as a starting point of a random walk construction algorithm according to a meta-path set, a preset restart probability and the transition probability matrix includes:

randomly taking a meta-path which does not participate in the random walk from the meta-path set as a walk meta-path, taking the root vertex as a starting point of a random walk construction algorithm, and expanding the random walk with a preset path length for P times according to the walk meta-path, the preset restart probability and the transition probability matrix which are taken out from the meta-path set until all meta-paths in the meta-path set participate in the random walk, wherein the meta-paths in the meta-path set comprise: the video multi-mode semantic graph comprises a first meta-path, a second meta-path, a third meta-path and a fourth meta-path, wherein the first meta-path, the second meta-path, the third meta-path and the fourth meta-path sequentially represent the same type of relation, the same label relation, the same watching relation and the good friend watching relation in the video multi-mode semantic graph respectively, the same type of relation represents that 2 videos are the same, the same label relation represents that the labels of 2 videos are the same, the same watching relation represents that 2 videos are watched by the same user in the user set, and the friend watching relation represents that 2 videos are watched by 1 pair of friends in the user set.

Optionally, the generating the target video adjacency matrix according to M vertex pair lists corresponding to M video vertices includes:

updating an initial context co-occurrence matrix according to M vertex pair lists to obtain a target context co-occurrence matrix, wherein elements O of an mth row and a qth column in the target context co-occurrence matrix _mq Representing the number of times that an mth video and a q video co-occur in the same context, wherein the mth video is a video corresponding to an mth video vertex, the q video is a video corresponding to a q video vertex, the target context co-occurrence matrix is a symmetric square matrix of m×m, and M and q are greater thanOr a positive integer equal to 1 and less than or equal to M;

and generating the target video adjacency matrix according to the target context co-occurrence matrix.

Optionally, the updating the initial context co-occurrence matrix according to the M vertex pair lists to obtain the target context co-occurrence matrix includes:

obtaining the number N of vertex pairs consisting of the (r) th video vertex and the (t) th video vertex from the M vertex pair lists _rt Wherein r and t are positive integers which are larger than or equal to 1 and smaller than or equal to M, and r is not equal to t;

Co-occurrence of element O in the initial context matrix _rt And O _tr Respectively increasing the value of N _rt And obtaining the target context co-occurrence matrix, wherein all elements of the initial context co-occurrence matrix are 0.

Optionally, the fusing the features of each video in the video set on multiple modalities into fused feature information includes:

the c fusion characteristic information of a c-th video in M videos included in the video set is obtained through the following steps, wherein c is a positive integer greater than or equal to 1 and less than or equal to M:

extracting a visual feature vector of the c-th video, wherein the visual feature vector is used for representing the features of the c-th video under the self visual mode;

extracting an audio feature vector of the c-th video, wherein the audio feature vector is used for representing the feature of the c-th video under the audio mode of the c-th video;

extracting a text feature vector of the c-th video, wherein the text feature vector is used for representing the feature of the c-th video under the own text mode;

and fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video into a c-th target fusion feature vector, and taking the c-th target fusion feature vector as the c-th fusion feature information.

Optionally, the extracting the visual feature vector of the c-th video includes:

sampling the c-th video by adopting a first preset time interval sampling mode to obtain k corresponding to the c-th video _e A frame picture;

let said k _e Each frame of picture in the frame pictures is input into an image feature extraction model to obtain k output by the image feature extraction model _e Picture feature vectors;

according to said k _e A picture feature vector generates the visual feature vector of the c-th video.

Optionally, the extracting the audio feature vector of the c-th video includes:

extracting audio mode data of the c-th video;

dividing the audio mode data into k according to a second preset time interval based on a time dimension _a Segment audio modality data;

let said k _a Inputting each segment of sub-audio mode data in the segment of sub-audio mode data into an audio feature extraction model to obtain k output by the audio feature extraction model _a Audio segment feature vectors;

according to said k _a An audio segment feature vector generates the audio feature vector for the c-th video.

Optionally, the extracting the text feature vector of the c-th video includes:

Extracting k corresponding to the c-th video from the text associated with the c-th video _t Video texts;

let said k _t Each video text in the video texts is input into a text feature extraction model to obtain k output by the text feature extraction model _t A text segment feature vector;

according to said k _t And generating the text feature vector of the c-th video by using the text segment feature vector.

Optionally, the fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video into a c-th target fusion feature vector includes:

the method comprises the steps that a weight parameter of a capsule network fusion model is subjected to D round adjustment by using a visual feature vector, an audio feature vector and a text feature vector corresponding to a c-th video to obtain a target capsule network fusion model, wherein the weight parameter comprises a visual weight parameter, an audio weight parameter and a text weight parameter, the visual weight parameter is used for indicating the weight of the feature of the video in the visual mode of the video in the process of fusion of the feature by the capsule network fusion model, the audio weight parameter is used for indicating the weight of the feature of the video in the audio mode of the video in the process of fusion of the feature by the capsule network fusion model, the text weight parameter is used for indicating the weight of the feature of the video in the text mode of the video in the process of fusion of the feature by the capsule network fusion model, the capsule network fusion model used for the D-th round is a capsule network fusion model obtained after the D-1 round adjustment of the weight parameter is a preset positive integer greater than or equal to 1, and D is a positive integer less than or equal to D, and when D is taken to 1, the capsule network fusion model is a capsule network model with the weight parameter which is not adjusted initially;

And fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video by using the target capsule network fusion model to obtain the c-th target fusion feature vector.

Optionally, the adjusting the weight parameter of the capsule network fusion model by using the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video includes:

the method comprises the following steps of carrying out d-th round adjustment on the weight parameters of a capsule network fusion model by using the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video:

determining a capsule network fusion model obtained after the d-1 th round of adjustment of the weight parameters is completed as a capsule network fusion model used for the d round of fusion;

inputting the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video into the d-th wheel fusion used capsule network fusion model to obtain a reference fusion feature vector output by the d-th wheel fusion used capsule network fusion model;

and adjusting the weight parameters of the capsule network fusion model used in the d-th round of fusion by using the reference fusion feature vector to obtain the capsule network fusion model to be used in the d+1-th round of fusion.

Optionally, the inputting the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video to the d-th-round fusion used capsule network fusion model to obtain a reference fusion feature vector output by the d-th-round fusion used capsule network fusion model includes:

the capsule network fusion model for the d-th round fusion uses the following steps to fuse the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video to obtain a reference fusion feature vector output by the capsule network fusion model for the d-th round fusion:

respectively carrying out linear transformation on the visual feature vector, the audio feature vector and the text feature vector to obtain corresponding linear visual feature vector, linear audio feature vector and linear text feature vector;

performing accumulation and calculation on a weighted visual feature vector, a weighted audio feature vector and a weighted text feature vector to obtain a weighted fusion feature vector, wherein the weighted visual feature vector is obtained by performing outer product operation on a visual weight parameter and a linear visual feature vector of a capsule network fusion model used by the d-th round fusion, the weighted audio feature vector is obtained by performing outer product operation on an audio weight parameter and a linear audio feature vector of a capsule network fusion model used by the d-th round fusion, and the weighted text feature vector is obtained by performing outer product operation on a text weight parameter and a linear text feature vector of a capsule network fusion model used by the d-th round fusion;

And converting the weighted fusion feature vector into the reference fusion feature vector by adopting a nonlinear activation function.

Optionally, the adjusting the weight parameter of the capsule network fusion model used in the d-th round of fusion by using the reference fusion feature vector to obtain a capsule network fusion model to be used in the d+1-th round of fusion includes:

obtaining adjustment parameters, wherein the adjustment parameters comprise visual adjustment parameters, audio adjustment parameters and text adjustment parameters, the visual adjustment parameters are obtained by performing outer product operation on the linear visual feature vector and the weighted fusion feature vector, the audio adjustment parameters are obtained by performing outer product operation on the linear audio feature vector and the weighted fusion feature vector, and the text adjustment parameters are obtained by performing outer product operation on the linear text feature vector and the weighted fusion feature vector;

and performing addition operation on the visual weight parameters of the capsule network fusion model used in the d-th round of fusion and the visual adjustment parameters to obtain the visual weight parameters of the capsule network fusion model used in the d+1-th round of fusion, performing addition operation on the audio weight parameters of the capsule network fusion model used in the d-th round of fusion and the audio adjustment parameters to obtain the audio weight parameters of the capsule network fusion model used in the d+1-th round of fusion, and performing addition operation on the text weight parameters of the capsule network fusion model used in the d-th round of fusion and the text adjustment parameters to obtain the text weight parameters of the capsule network fusion model used in the d+1-th round of fusion.

Optionally, the determining, in the video set, the target video to be recommended to the target user according to the user feature information of the target user and the video feature information set includes:

determining the similarity between each piece of video characteristic information in the video characteristic information set and the user characteristic information;

and determining the video corresponding to the video characteristic information with the similarity larger than the target similarity threshold as the target video.

Optionally, before determining the target video to be recommended to the target user in the video set according to the user feature information of the target user and the video feature information set, the method further includes:

acquiring nth user characteristic information in the user characteristic information set, wherein the nth user characteristic information is used for indicating the characteristics of videos preferred by the nth user in the user set;

acquiring an nth video viewing sequence corresponding to the nth user, wherein the video viewing sequence records the playing sequence of the video of the played video of the corresponding user, and the user set comprises N users, wherein N is a positive integer which is greater than or equal to 1 and less than or equal to N;

Acquiring the video characteristic information corresponding to each video in the nth video watching sequence from the video characteristic information set to obtain an nth reference video characteristic information set;

and merging all the reference video feature information in the nth reference video feature information set into a feature vector to obtain the nth user feature information.

Optionally, the adding the relationship feature to the fused feature information set to obtain the video feature information set includes:

inputting a target video adjacency matrix and the fusion characteristic information set into a target fusion network to obtain the video characteristic information set output by the target fusion network, wherein the target video adjacency matrix is used for representing the relationship characteristics among video vertices on video types, video labels and watching users and the association degree among videos connected by each relationship characteristic, and the target fusion network is used for updating each fusion characteristic information in the input fusion characteristic information set into corresponding video characteristic information according to the relationship characteristics represented by the target video adjacency matrix to obtain the video characteristic information set.

Optionally, the target fusion network includes:

the input layer and the L layer graph capsule convolution layers, a first layer graph capsule convolution layer in the L layer graph capsule convolution layers comprises a basic video vertex capsule, the rest of the L layer graph capsule convolution layers comprise advanced video vertex capsules, an L layer graph capsule convolution layer in the L layer graph capsule convolution layers further comprises a final video vertex capsule, the basic video vertex capsule is used for executing a first convolution operation on each fusion feature information in a fusion feature information set according to a target video adjacency matrix to obtain a convolution feature vector set output by the first layer graph capsule convolution layer, the advanced video vertex capsule in the first layer graph capsule convolution layer is used for executing a second convolution operation on the received convolution feature vectors according to the target video adjacency matrix to obtain convolution feature vectors input by the advanced video vertex capsule in the first+1 layer graph capsule convolution layer, the advanced video vertex capsule is used for executing a third convolution operation on the convolution feature vectors output by the advanced video vertex capsule in the L layer graph capsule, and the third convolution operation is used for aggregating the convolution feature vectors to output the feature vectors.

Optionally, before the target video adjacency matrix and the fusion feature information set are input to a target fusion network to obtain the video feature information set output by the target fusion network, the method further includes:

acquiring an initial fusion network;

performing X-round video classification training on the initial fusion network to obtain a target pre-training fusion network, wherein X is a positive integer greater than or equal to 1, and the accuracy of the target pre-training fusion network on video classification is greater than the target accuracy;

and performing Y-round video recommendation training on the target pre-training fusion network to obtain the target fusion network, wherein Y is a positive integer greater than or equal to 1.

Optionally, the performing the X-round video classification training on the initial fusion network to obtain a target pre-training fusion network includes:

executing an X-th round of video recommendation training in the X-round of video classification training on the initial fusion network through the following steps:

in the x-th round of video recommendation training, classifying the video samples marked with the video type labels by using a pre-training fusion network obtained by the x-1-th round of video classification training to obtain classification results;

Generating a first target loss value according to the classification result and the video type label;

and under the condition that the first target loss value does not meet a first preset convergence condition, adjusting network parameters of a pre-training fusion network used by an X-th round, determining the adjusted pre-training fusion network as a pre-training fusion network used by an x+1th round, and under the condition that the first target loss value meets the first preset convergence condition, determining the pre-training fusion network used by the X-th round as the target pre-training fusion network, wherein X is a positive integer greater than or equal to 1, X is a positive integer greater than or equal to 1 and less than or equal to X, and under the condition that the X takes the value of 1, the pre-training fusion network used by the X-th round is the initial fusion network.

Optionally, the performing the Y-round video recommendation training on the target pre-training fusion network to obtain the target fusion network includes:

executing a Y-th round of video recommendation training in the Y-round of video recommendation training on the target pre-training fusion network through the following steps:

in the video recommendation training of the y-th round, a reference fusion network obtained by the pre-training of the y-1-th round is used for generating the S+1st predicted video of a video viewing sequence sample based on the first S videos of the video viewing sequence sample, wherein the video viewing sequence sample is a known video viewing sequence and is used for recording the playing sequence of videos which are played in the video set by a corresponding user, the video viewing sequence sample comprises W videos, S is a positive integer which is greater than or equal to 1 and less than or equal to W, W is a positive integer which is greater than or equal to 1, and the reference fusion network used by the y-th round is the target pre-training fusion network under the condition that y takes the value of 1;

Generating a second target loss value according to the S+1st predicted video and the S+1st real video of the video watching sequence sample;

and under the condition that the second target loss value does not meet a second preset convergence condition, adjusting network parameters of a reference fusion network used by the y-th round of video recommendation training, determining the adjusted reference fusion network as the reference fusion network used by the y+1th round of video recommendation training, and under the condition that the second target loss value meets the second preset convergence condition, determining the reference fusion network obtained by the y-th round of pre-training as the target fusion network.

According to another embodiment of the present application, there is also provided a video recommendation apparatus, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a video characteristic information set, the video characteristic information set comprises video characteristic information corresponding to each video in the video set, the video characteristic information is used for representing multi-mode fusion characteristics of the corresponding video and relationship characteristics between the corresponding video and other videos in the video set, the relationship characteristics comprise characteristics of the videos in multiple video watching dimensions, and the multi-mode fusion characteristics comprise characteristics of the videos on multiple modes;

The determining module is used for determining target videos to be recommended to the target users in the video set according to the user characteristic information of the target users and the video characteristic information set under the condition that the videos are recommended to the target users in the user set;

and the recommending module is used for recommending the target video to the target user.

According to yet another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the video recommendation method described above when run.

According to still another aspect of the embodiments of the present application, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the video recommendation method described above through the computer program.

In the embodiment of the application, when a video is required to be recommended to a target user in a user set, a video feature information set is acquired, wherein the video feature information set comprises video feature information corresponding to each video in the video set, each video feature information can represent a multi-mode fusion feature of the corresponding video and a relationship feature between the corresponding video and other videos in the video set, the relationship feature comprises features of the videos in multiple video watching dimensions, the multi-mode fusion feature comprises features of the videos themselves in multiple modes, and then a target video to be recommended to the target user is determined from the video set according to the user feature information and the video feature information set of the target user, and the target video is recommended to the target user. The target video recommended by the method refers to the multimodal fusion characteristics of the target video and the relation characteristics between the target video and other videos in the video set, so that the matching degree of the recommended target video and a target user is higher. By adopting the technical scheme, the problems of low matching degree between the recommended video and the user and the like in the related technology are solved, and the technical effect of improving the matching degree between the recommended video and the user is realized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic diagram of a hardware environment of a video recommendation method according to an embodiment of the present application;

FIG. 2 is a flow chart of a video recommendation method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a heterogeneous information network according to the related art;

FIG. 4 is a schematic diagram of a video recommendation system according to an embodiment of the present application;

FIG. 5 is a schematic diagram of feature vector fusion into fused feature information according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a target fusion network according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a video recommendation process according to an embodiment of the present application;

Fig. 8 is a block diagram of a video recommendation device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method embodiments provided in the embodiments of the present application may be performed in a computer terminal, a device terminal, or a similar computing apparatus. Taking a computer terminal as an example, fig. 1 is a schematic diagram of a hardware environment of a video recommendation method according to an embodiment of the present application. As shown in fig. 1, the computer terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and in one exemplary embodiment, may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the computer terminal described above. For example, a computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than the equivalent functions shown in FIG. 1 or more than the functions shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a video recommendation method in an embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (NIC) that may be connected to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

In this embodiment, a video recommendation method is provided and applied to the computer terminal, and fig. 2 is a flowchart of a video recommendation method according to an embodiment of the present application, as shown in fig. 2, where the flowchart includes the following steps:

step S202, a video feature information set is obtained, wherein the video feature information set comprises video feature information corresponding to each video in the video set, the video feature information is used for representing multi-mode fusion features of the corresponding video and relationship features between the corresponding video and other videos in the video set, the relationship features comprise features of the videos in multiple video watching dimensions, and the multi-mode fusion features comprise features of the video itself in multiple modes;

Step S204, under the condition that video is recommended to a target user in a user set, determining a target video to be recommended to the target user in the video set according to user characteristic information of the target user and the video characteristic information set;

step S206, recommending the target video to the target user.

Through the steps, when the video is required to be recommended to the target user in the user set, a video feature information set is acquired, wherein the video feature information set comprises video feature information corresponding to each video in the video set, each video feature information can represent a multi-mode fusion feature of the corresponding video and a relationship feature between the corresponding video and other videos in the video set, the relationship feature comprises features of the videos in multiple video watching dimensions, the multi-mode fusion feature comprises features of the videos on multiple modes, and further, the target video to be recommended to the target user is determined from the video set according to the user feature information and the video feature information set of the target user, and the target video is recommended to the target user. The target video recommended by the method refers to the multimodal fusion characteristics of the target video and the relation characteristics between the target video and other videos in the video set, so that the matching degree of the recommended target video and a target user is higher. By adopting the technical scheme, the problems of low matching degree between the recommended video and the user and the like in the related technology are solved, and the technical effect of improving the matching degree between the recommended video and the user is realized.

Before describing in detail the video recommendation method proposed in the present application, the basic symbols used in the present application and the problems to be solved will be described first. Specifically, all vectors and scalars are represented by lower case letters (e.g., e) and lower case letters (u), respectively; using capital letters to represent a matrix (e.g., W) and using capital flower letters to represent a set (e.g.)。

Let theAnd->Representing a user set and a video set, respectively. Every user +.>Are all associated with one from->Video sequence of->Wherein, indicating that user u has viewed the number of videos. For each video->It contains information of three modes m= { e, a, t } of a visual mode e, an audio mode a and a text mode t (in this application, "mode" can be understood as "media form", for example, the visual mode represents the visual media form, the audio mode represents the audio media form, and the text mode represents the text media form), and these information respectively obtain visual feature vectors through different feature extraction methods (see below)Audio feature vector +.>Text feature matrix->Wherein (1)>Representing the feature dimension under a particular modality mε M. In addition, part of the video has a predefined category label and uses the hot unique label vector (one-hot label) y e {0,1}, the video is a video with a predetermined category label ^C And (C) represents the total number of categories.

Formally, the technical problem addressed by the present application includes two closely related sub-problems: (1) video multimodal semantic fusion problem; (2) video recommendation problem. The former refers to how to vector visual features of videoAudio feature vector +.>And text featuresVector->Fusing rich semantic relationships among different videos into a feature vector x _j . The latter is a video predicted to be watched by the user next time step, i.e. the next video recommendation problem. Inputs to the problem include: user collectionVideo set->And +/per user>Video viewing sequence->And (3) outputting: predicting the video v most likely accessed by the user u in the next time step, wherein the video is the video which is not accessed by the user u before, namely, the video satisfies the following condition

In the solution provided in step S202, taking the video set including 100 videos, where the user set includes 3 users as an example, the video feature information set includes 100 video feature information, where each video feature information is used to represent a multimodal fusion feature of a corresponding video, and the corresponding video has a semantic relationship with other videos in the video set, and similarly, the 3 user feature information set includes 3 user feature information, where each user feature information is used to represent, by using video feature information corresponding to a video that is played by a corresponding user, a feature of a video that is preferred by the corresponding user, for example, user a of 3 users, where user a has played 20 videos of 100 videos in the video set, where 20 videos (which can be understood as a video viewing sequence of user a) can find the corresponding 20 video feature information in the video feature information set, and then the 20 video feature information here can be used to generate user feature information of user a to represent the feature of the video that is preferred by user a.

Optionally, in this embodiment, the multi-mode fusion feature is a fusion feature of features of a video on multiple modes, for example, for a video a, where the video a has a visual feature in a visual mode, the video has an audio feature in an audio mode, the video has a text feature in a text mode, and the video a is fused with the visual feature, the audio feature and the text feature corresponding to the video a to obtain a multi-mode fusion feature corresponding to the video a, where the video a may have a feature fusion of other modes (including the visual mode, the audio mode and the text mode) except for the multiple modes in the above 3, so as to obtain the multi-mode fusion feature.

In one exemplary embodiment, the set of video feature information may be acquired, but is not limited to, by: extracting features of semantic edges from a video multi-modal semantic graph of the video set as the relationship features, and acquiring the multi-modal fusion features of each video in the video set to obtain a fusion feature information set, wherein the video multi-modal semantic graph is used for displaying the relationship features between videos in the video set in the form of video vertices and the semantic edges, each video vertex represents one video, and each semantic edge represents one relationship feature; and adding the relation features to the fusion feature information set to obtain the video feature information set.

Optionally, in this embodiment, before introducing the concept of the video multimodal semantic graph, the description of the related concepts and definitions related to this application is needed:

definition 1: heterogeneous Information Networks (HIN) are a network structure that represents and handles different types of vertices and edges. Unlike conventional homography where vertices and edges are homomorphic, vertices and edges in HIN can be of different types and attributes, often used to describe multiple entities and entities in the real worldComplex association relationship between them. Generally, HIN is modeled as a quadWherein (1)>And->Representing a set of vertices and edges, respectively; phi: />And ψ: />Representing vertex type mapping functions and edge type mapping functions, respectively. Here, the->And->Respectively representing a vertex type set and a side type set, and satisfying +.>

Definition 2: meta-Path (Meta-Path), for a givenA length ofThe meta-path pi of (a) has the following form: />(can be abbreviated as->) Wherein, the method comprises the steps of, wherein,/>and->Respectively representing a specific vertex type and a specific edge type in the HIN.

For a given setOne meta-path pi may correspond to several specific paths, called path instances.

Definition 3: the multi-modal heterogeneous information network (MultimodalHeterogeneousInformationNetworks, MHIN) is a heterogeneous information network representing different types of objects and their semantic relationships in a video recommendation system, typically modeled as Wherein (1)>Including all users->Video frequencyCategory->And tag->Thus there is Consists of different semantic connection edges between four types of objects.

The core aim of the method is to learn the video embedding with multi-mode semantic enhancement by utilizing the multi-mode characteristics of the video and the semantic relation between the videos so as to improve the video recommendation accuracy. Therefore, a group of element paths on the MHIN are designed to mine rich semantic relations among videos, and the MHIN is converted into a homogeneous video multi-mode semantic graph by removing non-video type vertexes in the MHIN and taking the mined semantic relations among videos as edges.

Definition 4: video multimodal semantic graph (VideoMulti-ModalRelationalGraph, VMG) is a homogenous network of information representing video objects and their semantic relationships, typically modeled asWherein,representing a video collection; />Representing similarity relation between videos based on meta paths;is a set of edge weights.

Based on the above definition, the Heterogeneous Information Network (HIN) in the related art is introduced as follows, and the above mentioned relationships between videos and users are complex for the video collection and the user collection, between videos and users, between users and users, when recommending videos for users, interaction data between users and videos is generally adopted as a recommendation reference, however, besides, the video recommendation system also includes abundant auxiliary data, such as social relationship of the user side, multi-mode information such as vision, audio and text of the video side, and categories and labels of videos. These ancillary information has heterogeneity and complexity and can be generally characterized by heterogeneous information networks modeling different types of entities and associations between them.

Fig. 3 is a heterogeneous information network according to the related artThe network schematic diagram, as shown in fig. 3, includes four entity types of user (u), video (v), category (t) and label (c), and four relationships of social relationship, viewing relationship, category attribution and label mark among the entities. More hidden relations between videos can be mined from the existing relations. For example, if user u _i And u _j All watch video v _k Second order connectivity u _i ←v _k ←u _j Behavior similarity between two users is captured explicitly. Third order connectivity u _i ←v _k ←u _J ←v _l Representing user u _i It is possible to access video v _l Because of the user u similar thereto _j V has been previously observed _l . Thus, this high-order connectivity contained in the user-item bipartite graph encodes rich semantic information of the co-signal. "video-tag-video" according to meta-path (e.g. v ₄ -t ₁ -v ₅ ) The same label relationship between two videos can be deduced; "video-user-video" according to meta-path (e.g. v ₁ -u ₁ -u ₅ -v ₈ ) A buddy viewing relationship between two videos may be inferred. Therefore, through designing the meta-paths representing different semantics, more hidden relations among different videos are mined from the HIN, the videos are recommended to the user based on the similarity among the videos, and the recommendation accuracy and the user satisfaction are improved. Such an approach is called HIN-based recommendation. The recommendation method is based on the meta-path, and the multi-mode information such as vision, audio and text contained in the video is ignored; (2) The contribution degree of different modal information of the video to interest preference of different users is different. These two shortcomings determine that the HIN-based recommendation method cannot accurately measure the similarity between videos and the preference of users to the videos, so that the recommendation effect is not ideal. For example, in FIG. 3, a "video-user-video" (e.g., v) ₁ -u ₁ -u ₂ -v ₉ And v ₁ -u ₁ -u ₂ -v ₅ ) Video v ₁ And v ₉ Is commonly viewed by a pair of friends and thus may be similar; but v ₁ And v ₅ However, they belong to the literature love piece and the science fiction action piece respectively, and have obvious differences in the aspects of vision, audio, text and other characteristics, so that the similarity is low. Video v ₉ At the same time by user u ₂ And u ₅ Viewed but u ₂ Quilt v ₉ Natural, true and profound emotional communication (text) between the main angles of man and woman, u ₅ The visual impact (vision) caused by the natural wind and light and heavy personal smell of the greek petersla (Peloponnese) region, 36836, 36902, is preferred.

Unlike HIN-based recommendation methods which focus only on semantic information between videos. The method and the device for recommending the video through the multi-mode information are advocated that the multi-mode information of the video and the rich semantic relation between the videos are complementary when video recommendation is solved, and the video recommendation accuracy can be effectively improved through organic combination.

Therefore, the application also provides a video recommendation system (AVideoRecommendation system based on the multi-modal semantic enhancement map capsule neural network) phCapsuleNeuralNetwork, MSGCN, which is called as a video recommendation system for short. The video recommendation system may use the video recommendation method proposed in the present application to recommend videos to users, and fig. 4 is a schematic diagram of a video recommendation system according to an embodiment of the present application, where, as shown in fig. 4, the video recommendation system is composed of a multimodal information preprocessing module, a multimodal heterogeneous information network construction module, a meta-path module, a video multimodal semantic graph construction module, a graph capsule neural network module, a user embedding extraction module, and a recommendation module. The multi-mode information preprocessing module is responsible for extracting visual and audio characteristics from the video and cleaning text information; and then, respectively extracting the characteristics of three modes of video, such as vision, audio and text by means of a popular deep learning network. The multi-mode heterogeneous information network construction module is responsible for extracting various entities and semantic relations among the entities in the video recommendation system to construct a multi-mode heterogeneous information network. The meta-path module designs four meta-paths for representing four semantic relationships of the same type (i.e., the same type relationship), the same tag (i.e., the same tag relationship), the same view (i.e., the same view relationship), and the friend's common view (i.e., the friend view relationship) among videos. And executing a random walk algorithm based on a meta-path on the multimodal heterogeneous information network to extract rich semantic relations among videos so as to construct a video multimodal semantic graph. The graph capsule neural network module is responsible for extracting multi-mode semantic enhancement video embedding from the video multi-mode semantic graph, and the embedding not only fuses the characteristics of three modes of video, namely visual, audio and text, but also fuses rich semantic relations among different videos. The user-embedding extraction module extracts multimodal, semantically enhanced user embeddings (i.e., user feature information) from a sequence of video viewing by a user. The recommendation module calculates the probability of watching all videos which are not accessed by the user according to the learned multi-mode semantic enhanced video embedding (namely, video characteristic information) and the user embedding, and returns the video with the highest probability to the user as a recommendation result. The recommendation method can greatly improve the accuracy of video multi-mode embedding and further improve the accuracy of video recommendation by comprehensively considering the visual, audio and text three-mode characteristics of the video and the rich semantic relation among the three-mode characteristics. In addition, the provided network adopts a training mode of pretraining and fine tuning, so that the dependence on the number of marked samples can be greatly reduced, and the network training efficiency is improved.

In an exemplary embodiment, the characteristics of the semantic edges can be extracted from the video multi-mode semantic graphs of the video set as the relationship characteristics, and the multi-mode fusion characteristics of each video in the video set are obtained to obtain a fusion characteristic information set by the following manners: converting the video multi-mode semantic graph into a target video adjacency matrix, and obtaining the relation features according to the similarity degree of features between any two video vertexes in the video multi-mode semantic graph represented by the target video adjacency matrix in a plurality of video watching dimensions; and fusing the characteristics of each video in the video set on a plurality of modes into fused characteristic information to obtain the fused characteristic information set, wherein the fused characteristic information is used for representing the multi-mode fused characteristics of the corresponding video.

Optionally, in this embodiment, a detailed process of converting the video multimodal semantic graph into the target video adjacency matrix is described as follows:

to construct aAnd->The application designs four-element path +.> The method is used for representing four semantic relations, namely a relation of the same type, a relation of the same label, a relation of the same view and a relation of the friend view among videos. Designing a VMG adjacent matrix random walk construction algorithm based on a meta-path, executing the random walk algorithm based on the meta-path on MHIN to extract a series of paths (contexts) with specific length, calculating co-occurrence frequency of any two video pairs by randomly sampling the paths, taking the frequency as the similarity of the video pairs based on the meta-path, and finally obtaining a video adjacent matrix- >Wherein, when element->When representing video v _j And v _k The middle is surrounded by edges->Are connected and are edge->Weight of +.>When->When v is represented by _j And v _k There are no edge links. Thus, by video adjacency matrix->Edge set +.>And set of edge weights->Here video adjacency matrix->Namely the target video adjacency matrix, video adjacency matrix +.>Edge set of->For representing said semantic relationship between said video vertices on video type, video label and viewing user, set of edge weights +.>For representing the degree of association between videos connected by each semantic relationship, and further obtaining a video multi-modal semantic graph ++>Given MHIN and meta-path set pi= { pi ₁ ，π ₂ ，π ₃ ，π ₄ VMG video adjacency matrix random walk construction algorithm (BVMG) based on meta-path specifically comprises the following steps:

step 1: initializing vertex-context co-occurrence matricesIt is subjected toSetting all elements to zero;

step 2: taking an unused element path pi from element path set pi, and calculating a single-step transition probability matrix of restarting random walk based on element path piA random walker is arranged at the peak of MHIN numbered index (tau) at time tau, and is not restricted to->And satisfy->Then the next time step tau +1 it moves to v _k ∈N(v _j ，k _τ+1 ) The probability of (2) is +.>

Wherein N (v) _j ，k _τ+1 ) Representing vertex v in MHIN _j All types of k _τ+1 Is a set of neighbors of a group (a). Repeatedly calculating transition probability from each vertex to all adjacent vertices

Step 3: for a pair ofVideo vertex set->Any vertex of (a)V is set as _k For root vertices, restart is performed on MHINThe probability is gamma epsilon (0, 1) and the transition probability matrix isThe path length is +.>Is a random walk of (1); repeat->The times, get m pieces of length +.>Is of the path pi of (a) _j1 ，π _j2 ，...，π _j，m The method comprises the steps of carrying out a first treatment on the surface of the Each path is a vertex v _j Is a context ctx of (1); v is recorded _j Is pi _j ；

Step 4: any vertex in video vertex set vPath set pi of (2) _j The implementation window size is more than 2 and less than or equal to L _W Is to randomly sample a pair of vertices at a time, sample +.>Obtaining a list Ω of all vertex pairs at a time _j ＝[(v _j ，v _k )|v _j ，v _k ∈π]The method comprises the steps of carrying out a first treatment on the surface of the For each vertex pair (v _j ，v _k )∈Ω _j Updating element o in the vertex-context co-occurrence matrix (which can be understood as the initial context co-occurrence matrix) _jk And o _kj Is the value of (1): o (o) _jk ←o _jk +1，o _kj ←o _kj +1；

Step 5: steps 2-4 are repeated until all paths in the meta-path set pi are fetched.

Step 6: computing vertices from a vertex-context co-occurrence matrix O (which may be understood as a target context co-occurrence matrix)Presence in context ctx _k Probability p (v) _j ，ctx _k ) And its edge probability p (v) _j ) And p (ctx) _k )：

VMG video adjacency matrixIs->The value of (2) can be calculated by the following formula:

in one exemplary embodiment, the video multimodal semantic graph may be converted to a target video adjacency matrix by, but is not limited to, the following: acquiring a transition probability matrix corresponding to the video multi-mode semantic graph, wherein the transition probability matrix is used for indicating the transition probability of a random walker from one video vertex to each adjacent video vertex in the process of using a random walk construction algorithm to walk the video multi-mode semantic graph; taking an ith video vertex in M video vertices in the video multi-mode semantic graph as a root vertex, taking the root vertex as a starting point of a random walk construction algorithm, expanding random walks with preset path length for P times according to a meta-path set, preset restart probability and the transition probability matrix to obtain Q long walk paths corresponding to the ith video vertex, wherein the Q long walk paths form a context of the ith video vertex, the meta-path set comprises meta-paths used for representing relation characteristics between each video and other videos in the video set, the restart probability is used for indicating the probability of each step of the random walk to jump back to the starting point in the process of each random walk, i is a positive integer greater than or equal to 1 and less than or equal to M, and Q is a positive integer less than or equal to P; sampling the size of a preset window for the Q long-distance walking paths in sequence to obtain an ith vertex pair list of the ith video vertex, wherein when each sampling is recorded in the ith vertex pair list, a pair of video vertices at two ends of the sampling are sampled, and the length of the size of the preset window is more than 2 and less than the length of the preset path; and generating the target video adjacency matrix according to M vertex pair lists corresponding to the M video vertices.

Optionally, in this embodiment, a transition probability matrix corresponding to the video multi-mode semantic graph is obtained, where the transition probability matrix may be understood as the transition probability matrix

Optionally, in this embodiment, taking an ith video vertex of M video vertices in the video multi-mode semantic graph as a root vertex, taking the root vertex as a starting point of a random walk construction algorithm, and expanding a random walk with a preset path length for P times according to a meta-path set, a preset restart probability and the transition probability matrix to obtain Q long walk paths corresponding to the ith video vertex, which may be understood as Step 3 (Step 3), where the ith video vertex is taken as the root vertex, i.e., an arbitrary vertexV is set as _j For the root vertex, the element path set is pi= { pi ₁ ，π ₂ ，π ₃ ，π ₄ Pre-setting the restarting probability as gamma epsilon (0, 1), and transferring the probability matrix ∈>Develop a preset path length (preset path length takes a value) Random walk P (P takes the value +.>) And twice. m pieces of length->Is of the path pi of (a) _j1 ，π _j2 ，...，π _j，m Corresponding to the Q long walking paths (Q takes the value m); each path is a vertex v _j Is a context ctx of (1); v is recorded _j Is pi in the set of m paths _j 。

Optionally, in this embodiment, sampling the preset window sizes of the Q long walking paths sequentially to obtain an ith vertex pair list of the ith video vertex, where the ith vertex pair list records a pair of video vertices at two sampling ends when each sampling is performed on the ith vertex pair list, and the length of the preset window sizes is greater than 2 and less than the preset path length, which may be understood as Step 4 above, and sampling the preset window sizes of the Q long walking paths sequentially corresponds to the video vertex set Any vertex +.>Path set pi of (2) _j The implementation window size is 2 < w.ltoreq.L _W (i.e., a preset window size, the length of the preset window size is greater than 2 and less than the preset path length). Vertex pair list, list Ω _j 。

In one exemplary embodiment, the root vertices may be used as a starting point of a random walk construction algorithm by, but not limited to, expanding a random walk of a preset path length P times according to a set of meta paths, a preset restart probability, and the transition probability matrix by: randomly taking a meta-path which does not participate in the random walk from the meta-path set as a walk meta-path, taking the root vertex as a starting point of a random walk construction algorithm, and expanding the random walk with a preset path length for P times according to the walk meta-path, the preset restart probability and the transition probability matrix which are taken out from the meta-path set until all meta-paths in the meta-path set participate in the random walk, wherein the meta-paths in the meta-path set comprise: the video multi-mode semantic graph comprises a first meta-path, a second meta-path, a third meta-path and a fourth meta-path, wherein the first meta-path, the second meta-path, the third meta-path and the fourth meta-path sequentially represent the same type of relation, the same label relation, the same watching relation and the good friend watching relation in the video multi-mode semantic graph respectively, the same type of relation represents that 2 videos are the same, the same label relation represents that the labels of 2 videos are the same, the same watching relation represents that 2 videos are watched by the same user in the user set, and the friend watching relation represents that 2 videos are watched by 1 pair of friends in the user set.

Optionally, in this embodiment, the first meta path, the second meta path, the third meta path, and the fourth meta path respectively correspond to four meta paths in the meta path set pi

In one exemplary embodiment, the target context co-occurrence matrix may be obtained by, but is not limited to, updating the initial context co-occurrence matrix from the M vertex pair lists by _mq Representing the number of times that an mth video and a q-th video co-occur in the same context, wherein the mth video is a video corresponding to an mth video vertex, the q-th video is a video corresponding to a q-th video vertex, the target context co-occurrence matrix is a symmetric square matrix of M, M and q are positive integers which are greater than or equal to 1 and less than or equal to M; generating the target context co-occurrence matrix according to the target context co-occurrence matrixA target video adjacency matrix.

Optionally, in this embodiment, the initial context co-occurrence matrix is updated according to M vertex pair lists in the M vertex pair lists to obtain the target context co-occurrence matrix, which may be understood as "randomly sampling a pair of vertices each time, sampling in Step 4" described above Obtaining a list Ω of all vertex pairs at a time _j ＝[(v _j ，v _k )|v _j ，v _k ∈π]The method comprises the steps of carrying out a first treatment on the surface of the For each vertex pair (v _j ，v _k )∈Ω _j Updating element o in vertex-context co-occurrence matrix _jk And o _kj Is the value of (1): o (o) _jk ←o _jk +1，o _kj ←o _kj And (1) until all paths in the meta-path set pi are taken out, and obtaining the target context co-occurrence matrix.

Alternatively, in this embodiment, the target video adjacency matrix is generated according to the target context co-occurrence matrix, which may be understood as Step 6 above. Determining a target video adjacency matrix (video adjacency matrix) by a target context co-occurrence matrix (vertex-context co-occurrence matrix O)) Every element->

In one exemplary embodiment, the target context co-occurrence matrix may be obtained by, but is not limited to, updating the initial context co-occurrence matrix from the M vertex pair list by: obtaining the number N of vertex pairs consisting of the (r) th video vertex and the (t) th video vertex from the M vertex pair lists _rt Wherein r and t are positive integers which are larger than or equal to 1 and smaller than or equal to M, and r is not equal to t; co-occurrence of element o in the initial context matrix _rt And o _tr Respectively increasing the value of N _rt Obtaining the target context co-occurrence matrix, wherein the initial context co-occurrence matrix is obtained by the target context co-occurrence matrixAll elements of the current matrix are 0.

Optionally, in this embodiment, the number N of vertex pairs including the nth video vertex and the nth video vertex is obtained from the M vertex pair lists _rt Then element O in the initial context co-occurrence matrix _rt And O _tr Respectively increasing the value of N _rt Obtaining the target context co-occurrence matrix, for example, the number N of vertex pairs formed by the r-th video vertex and the t-th video vertex _rt 1, then element O in the initial context co-occurrence matrix _rt And O _tr The values of (a) are respectively increased by 1,O _rt ←O _rt +1，O _tr ←O _tr +1。

In one exemplary embodiment, features of each video in the video set itself over multiple modalities may be fused into fused feature information by, but not limited to: the c fusion characteristic information of a c-th video in M videos included in the video set is obtained through the following steps, wherein c is a positive integer greater than or equal to 1 and less than or equal to M: extracting a visual feature vector of the c-th video, wherein the visual feature vector is used for representing the features of the c-th video under the self visual mode; extracting an audio feature vector of the c-th video, wherein the audio feature vector is used for representing the feature of the c-th video under the audio mode of the c-th video; extracting a text feature vector of the c-th video, wherein the text feature vector is used for representing the feature of the c-th video under the own text mode; and fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video into a c-th target fusion feature vector, and taking the c-th target fusion feature vector as the c-th fusion feature information.

Optionally, in this embodiment, the fused feature information may be obtained by fusing feature vectors under multiple modes by using a multi-mode capsule network, and first, a concept of the capsule network is described: conventional CNN (ConvolutionalNeuralNetworks) networks are formed by stacking multiple convolutional layers, each of which is composed of a number of mutually independent neurons. Each neuron uses a single scalar output to summarize the activity of the repetitive feature detectors within the local region; each convolution layer extracts the characteristics of the local area through the convolution kernel, and the view invariance is realized by using a maximum pooling mode. In such architectural designs, the high-level features are a weighted sum of the low-level feature combinations. Although the max pooling operation extracts the most important features of the local region, the relative spatial relationship of the different features is ignored, such that the positional relationship of the high-level features and the low-level features becomes ambiguous. To overcome this deficiency, hinton et al propose a network of capsules, each of which is responsible for identifying a visual entity implicitly defined within a limited range of viewing conditions and deformations, and outputting the probability that it exists within that range and a set of "instance parameters", which may include pose, lighting conditions and deformation information relative to this visual entity. When a visual entity is moved within a limited range, the probability that the entity exists in that area is unchanged, but the strength parameter is "alike". That is, capsule networks (capsules) enable encoding spatial information while also calculating the probability of the presence of objects. The output of the capsule network may be represented by a vector whose modulus represents the probability that the feature exists and whose direction represents the pose information of the feature.

With reference to the above ideas, in this embodiment, a multi-mode capsule network is provided to fuse multi-mode features of a single video. Specifically, fig. 5 is a schematic diagram of feature vector fusion into fused feature information according to an embodiment of the present application, and as shown in fig. 5, a video is shownFeature vector of a certain modality M e M>And parameter matrix to be learned(/>Is->Matrix dimensions of (a) to obtain a new eigenvector +.>Feature vectorMultiplying weights +.>And summing the weighted input vectors to obtain a vector s _j The method comprises the steps of carrying out a first treatment on the surface of the Vector s is transformed with nonlinear activation function nom_linear_act _j Conversion to video->Multimodal feature x _j 。

The above-mentioned multi-modal feature x _j Corresponding to the c-th fusion characteristic information, the using process of the multi-modal capsule network after training is shown, and the multi-modal capsule network can be trained by adopting a multi-modal capsule network dynamic routing algorithm, wherein the multi-modal capsule network dynamic routing algorithm has the following related formulas:

wherein i is E [1, I]Representing the iteration number;and->Representing video separatelyThe initial coupling coefficient and the normalized coupling coefficient of the capsule of a certain mode M epsilon M and the multimode capsule; exp represents the natural exponential function; the term "vector" refers to a modular operation of a vector. non_linear_act represents a nonlinear activation function.

Specifically, the multi-mode capsule network dynamic routing algorithm comprises the following steps:

step 1: for videoThree modal feature vectors to perform linear transformationObtaining a new modal feature vector->

Step 2: initializing videoThree modality feature vectors->Temporary coupling coefficient with capsule network neurons +.>

Step 3: iteratively performing the following steps for the ith (i.e. [1, I)]) Iterative times, calculate videoThree modality feature vectors->Normalized coupling coefficient with capsule network neuron

Step 4: normalized coupling coefficient calculated according to previous stepVideoThree modality feature vectors->Calculating input vectors for encapsulated neural network neurons

Step 5: the input vector obtained in the previous step is calculated according to the formula (9)Non-linear act nonlinear operation is implemented to calculate video +.>Through the ith (i.e. [1, I)]) Multi-modal fusion feature after multiple iterations

Step 6: updating the temporary coupling coefficient according to formula (10),

step 7: when I is less than I, updating the iteration number I;

step 8: repeating steps 3-7 until i= I, and outputting the video at this timeMultimodal fusion feature of->

In summary, training the multi-modal capsule network through the dynamic routing algorithm of the multi-modal capsule network, namely, weighting the multi-modal capsule network Updating the I round to obtain final weight +.>The trained multi-mode capsule network is obtained, and the trained multi-mode capsule network can be used for outputting the video +_>Target fusion feature vector +.>As fusion characteristic information->

Optionally, in this embodiment, the multi-modal capsule network shown in fig. 5 includes: vision capsule and audio capsuleText capsule and multimodal capsule, wherein feature vectors in the visual capsuleRepresenting visual feature vector, feature vector +.>Representing the audio feature vector, feature vector +.>Representing text feature vectors,/->Feature vectors are fused for the object.

In one exemplary embodiment, the visual feature vector of the c-th video may be extracted, but is not limited to, by: sampling the c-th video by adopting a first preset time interval sampling mode to obtain k corresponding to the c-th video _e A frame picture; let said k _e Each frame of picture in the frame pictures is input into an image feature extraction model to obtain k output by the image feature extraction model _e Picture feature vectors; according to said k _e A picture feature vector generates the visual feature vector of the c-th video.

Alternatively, in this embodiment, the extracting the visual feature vector of the c-th video may, but is not limited to, by:extracting ++in equal time interval sampling manner by means of FFmpeg tool software>Frame pictures forming a key frame sequence->Pre-training on an ImageNet datasetGood ResNet-152 (a deep convolutional neural network model) neural network extracts its content features. Specifically, f is first _j，k Is randomly cut to 224 x 224 and input to the ResMNet-152 for feature extraction to obtain a +.>(taking the 2048-dimensional visual feature vector +.>Finally, k of each video is calculated _e Averaging the individual visual feature vectors to obtain the final visual feature vector +.>

In one exemplary embodiment, the audio feature vector of the c-th video may be extracted, but is not limited to, by: extracting audio mode data of the c-th video; dividing the audio mode data into k according to a second preset time interval based on a time dimension _a Segment audio modality data; let said k _a Inputting each segment of sub-audio mode data in the segment of sub-audio mode data into an audio feature extraction model to obtain k output by the audio feature extraction model _a Audio segment feature vectors; according to said k _a An audio segment feature vector generates the audio feature vector for the c-th video.

Alternatively, in this embodiment, the extracting the audio feature vector of the c-th video may, but is not limited to, be performed by:separating complete audio modality data therefrom by means of FFmpeg tool software, equally dividing it into k in the time dimension _a Segments, constituting a sequence of audio segmentsUsing SoundNet pre-trained on ImageNet dataset (one for audio classification and chordDeep learning model for frequency understanding) neural network extraction +.>(can take 1024) dimensional audio feature vector +.>Finally, k of each video is calculated _a Averaging the individual audio feature vectors to obtain the final audio feature vector +.>

In one exemplary embodiment, the text feature vector of the c-th video may be extracted, but is not limited to, by: extracting k corresponding to the c-th video from the text associated with the c-th video _t Video texts; let said k _t Each video text in the video texts is input into a text feature extraction model to obtain k output by the text feature extraction model _t A text segment feature vector; according to said k _t And generating the text feature vector of the c-th video by using the text segment feature vector.

Alternatively, in this embodiment, the text feature vector of the c-th video may be extracted, but not limited to, by the following manner: the text description of the video includes video titles, video summaries, labels, subtitles, user comments, etc., which may be understood, but are not limited to, as text associated with the video described above, and the present application focuses primarily on video titles, summaries, and labels.First of all text data associated with the video +.>Performing washing to remove characters and stop words which are not matched with the language type, and aligning the text length to be +.>For word numbers nw greater than k _t Text truncation, leaving only the top k _t A word; and nw is less than k for words _t Is used (k) _t -n _w ) The "Null" fills. Text data after washing->Can be expressed as +.>Wherein, for non-Null words, the patent generates a d by means of a pre-trained GloVe (Global word vector representation) model _t Word vector =128 dimensions->Finally, k of each video is calculated _t Averaging the individual text feature vectors to obtain the final text feature vector +.>

In one exemplary embodiment, the visual feature vector, the audio feature vector, and the text feature vector corresponding to the c-th video may be fused into a c-th target fusion feature vector by, but not limited to: the method comprises the steps that a weight parameter of a capsule network fusion model is subjected to D round adjustment by using a visual feature vector, an audio feature vector and a text feature vector corresponding to a c-th video to obtain a target capsule network fusion model, wherein the weight parameter comprises a visual weight parameter, an audio weight parameter and a text weight parameter, the visual weight parameter is used for indicating the weight of the feature of the video in the visual mode of the video in the process of fusion of the feature by the capsule network fusion model, the audio weight parameter is used for indicating the weight of the feature of the video in the audio mode of the video in the process of fusion of the feature by the capsule network fusion model, the text weight parameter is used for indicating the weight of the feature of the video in the text mode of the video in the process of fusion of the feature by the capsule network fusion model, the capsule network fusion model used for the D-th round is a capsule network fusion model obtained after the D-1 round adjustment of the weight parameter is a preset positive integer greater than or equal to 1, and D is a positive integer less than or equal to D, and when D is taken to 1, the capsule network fusion model is a capsule network model with the weight parameter which is not adjusted initially; and fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video by using the target capsule network fusion model to obtain the c-th target fusion feature vector.

Optionally, in this embodiment, the capsule network fusion model may, but is not limited to, refer to the multi-modal capsule network illustrated in fig. 5, and the D-wheel adjustment is performed on the weight parameters of the capsule network fusion model by using the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video, to obtain the target capsule network fusion model, that is, the multi-modal capsule network dynamic routing algorithm is used for the multi-modal capsule networkAnd carrying out the process of I round update.

Optionally, in this embodiment, the weight parameters include a visual weight parameter, an audio weight parameter, and a text weight parameter, which respectively correspond to those in fig. 5

Optionally, in this embodiment, the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video are fused by using the target capsule network fusion model to obtain the c-th target fusion feature vector, that is, the target fusion feature vector corresponds to the multi-mode capsule network in fig. 5The multi-modal capsule network (target capsule network fusion model) with complete training is obtained after the I round updating is carried out, and the multi-modal capsule network is used for outputting multi-modalFeature x _j (target fusion feature vector) process.

In one exemplary embodiment, the D-round adjustment of the weight parameters of the capsule network fusion model may be performed using the visual feature vector, the audio feature vector, and the text feature vector corresponding to the c-th video by, but not limited to: the method comprises the following steps of carrying out d-th round adjustment on the weight parameters of a capsule network fusion model by using the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video: determining a capsule network fusion model obtained after the d-1 th round of adjustment of the weight parameters is completed as a capsule network fusion model used for the d round of fusion; inputting the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video into the d-th wheel fusion used capsule network fusion model to obtain a reference fusion feature vector output by the d-th wheel fusion used capsule network fusion model; and adjusting the weight parameters of the capsule network fusion model used in the d-th round of fusion by using the reference fusion feature vector to obtain the capsule network fusion model to be used in the d+1-th round of fusion.

Alternatively, in the present embodiment, the weighting parameters using a multi-modal capsule network dynamic routing algorithm is illustrated in FIG. 5 And carrying out the process of updating the I round, and carrying out the adjustment of the D round on the weight parameters of the capsule network fusion model by using the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video, wherein D can be less than or equal to I but not limited to the D.

In an exemplary embodiment, the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video may be input to the capsule network fusion model for d-th round fusion, to obtain a reference fusion feature vector output by the capsule network fusion model for d-th round fusion, in the following manner: the capsule network fusion model for the d-th round fusion uses the following steps to fuse the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video to obtain a reference fusion feature vector output by the capsule network fusion model for the d-th round fusion: respectively carrying out linear transformation on the visual feature vector, the audio feature vector and the text feature vector to obtain corresponding linear visual feature vector, linear audio feature vector and linear text feature vector; performing accumulation and calculation on a weighted visual feature vector, a weighted audio feature vector and a weighted text feature vector to obtain a weighted fusion feature vector, wherein the weighted visual feature vector is obtained by performing outer product operation on a visual weight parameter and a linear visual feature vector of a capsule network fusion model used by the d-th round fusion, the weighted audio feature vector is obtained by performing outer product operation on an audio weight parameter and a linear audio feature vector of a capsule network fusion model used by the d-th round fusion, and the weighted text feature vector is obtained by performing outer product operation on a text weight parameter and a linear text feature vector of a capsule network fusion model used by the d-th round fusion; and converting the weighted fusion feature vector into the reference fusion feature vector by adopting a nonlinear activation function.

Optionally, in this embodiment, the visual feature vector, the audio feature vector, and the text feature vector are respectively subjected to linear transformation to obtain corresponding linear visual feature vectors, linear audio feature vectors, and linear text feature vectors, which correspond to step 1 in the multi-modal capsule network dynamic routing algorithm, for videoThree modal feature vectors implement a linear transformation +.>Obtaining a new modal feature vector->Wherein (1)>Is a visual feature vector, or an audio feature vector, or a text feature vector.

Optionally, in this embodiment, the weighted visual feature vector is obtained by performing an outer product operation on a visual weight parameter and a linear visual feature vector of the capsule network fusion model used in the d-th round of fusion, the weighted audio feature vector is obtained by performing an outer product operation on an audio weight parameter and a linear audio feature vector of the capsule network fusion model used in the d-th round of fusion, and the weighted text feature vector is obtained by performing an outer product operation on a text weight parameter and a linear text feature vector of the capsule network fusion model used in the d-th round of fusion, corresponding to step 2, step 3 and step 4 in the above-mentioned multi-modal capsule network dynamic routing algorithm, taking a calculation mode of the weighted visual feature vector as an example, and knowing the visual weight parameter of the capsule network fusion model used in the d-th round of fusion Calculating a normalized coupling coefficient +.>And then (2) is in charge of>And linear visual feature vector->And performing the weighted visual characteristic vector obtained by the outer product operation. Similarly, weighted audio feature vectors and weighted text feature vectors may be obtained. />

Alternatively, in the present embodiment, the addition and calculation is performed on the weighted visual feature vector, the weighted audio feature vector, and the weighted text feature vector, resulting in a weighted fusion feature vector correspondence formula (7), wherein,representing the weighted fusion feature vector.

Optionally, in this embodiment, the weighted fusion feature vector is converted to the reference fusion feature vector using a nonlinear activation function according to equation (9), wherein,representing reference fusion feature vectors->Representation pair-weighted fusion feature vector>A nonlinear operation is performed.

In one exemplary embodiment, the weight parameters of the capsule network fusion model used for the d-th round of fusion may be adjusted by, but not limited to, using the reference fusion feature vector to obtain the capsule network fusion model to be used for the d+1-th round of fusion: obtaining adjustment parameters, wherein the adjustment parameters comprise visual adjustment parameters, audio adjustment parameters and text adjustment parameters, the visual adjustment parameters are obtained by performing outer product operation on the linear visual feature vector and the weighted fusion feature vector, the audio adjustment parameters are obtained by performing outer product operation on the linear audio feature vector and the weighted fusion feature vector, and the text adjustment parameters are obtained by performing outer product operation on the linear text feature vector and the weighted fusion feature vector; and performing addition operation on the visual weight parameters of the capsule network fusion model used in the d-th round of fusion and the visual adjustment parameters to obtain the visual weight parameters of the capsule network fusion model used in the d+1-th round of fusion, performing addition operation on the audio weight parameters of the capsule network fusion model used in the d-th round of fusion and the audio adjustment parameters to obtain the audio weight parameters of the capsule network fusion model used in the d+1-th round of fusion, and performing addition operation on the text weight parameters of the capsule network fusion model used in the d-th round of fusion and the text adjustment parameters to obtain the text weight parameters of the capsule network fusion model used in the d+1-th round of fusion.

Optionally, in this embodiment, the weight parameter of the capsule network fusion model used in the d-th round of fusion is adjusted by using the reference fusion feature vector to obtain a capsule network fusion model to be used in the d+1-th round of fusion, which corresponds to step 6 in the above-mentioned multi-mode capsule network dynamic routing algorithm, where,representing adjustment parameters->And the visual weight parameters, or the audio adjustment parameters, or the text adjustment parameters of the capsule network fusion model used for the d-th round fusion are represented.

In one exemplary embodiment, the determining the target video to be recommended to the target user in the video set according to the user characteristic information of the target user and the video characteristic information set may include, but is not limited to, the following steps: determining the similarity between each piece of video characteristic information in the video characteristic information set and the user characteristic information; and determining the video corresponding to the video characteristic information with the similarity larger than the target similarity threshold as the target video.

In an exemplary embodiment, before the determining, in the video set, the target video to be recommended to the target user according to the user feature information of the target user and the video feature information set, the following manner may be included, but is not limited to: acquiring nth user characteristic information in the user characteristic information set, wherein the nth user characteristic information is used for indicating the characteristics of videos preferred by the nth user in the user set; acquiring an nth video viewing sequence corresponding to the nth user, wherein the video viewing sequence records the playing sequence of the video of the played video of the corresponding user, and the user set comprises N users, wherein N is a positive integer which is greater than or equal to 1 and less than or equal to N; acquiring the video characteristic information corresponding to each video in the nth video watching sequence from the video characteristic information set to obtain an nth reference video characteristic information set; and merging all the reference video feature information in the nth reference video feature information set into a feature vector to obtain the nth user feature information.

Optionally, in this embodiment, the generation manner of each piece of user characteristic information in the user characteristic information set may be, but is not limited to, the following: the first way is that the thermally unique coded subscriber identity can be simply used as its input feature, and the second way can be used with each subscriberVideo viewing sequence->And the foregoing learned video multimodal semantic enhancement embedding (which may be understood as the video feature information set described above) to learn multimodal semantic enhancement user embedding (which may be understood as the user feature information described above). Do not hinder and do->Is embedded into a set ofAll of them are directly obtained from the video multi-modal semantic enhancement embedment (video feature information set) learned by the video numbering according to the index.

Optionally, in this embodiment, merging all the reference video feature information in the nth reference video feature information set into one feature vector to obtain the nth user feature information may be, but is not limited to, implemented by a user vertex capsule:

firstly, designing a user vertex capsule function GUcap _u ：For the userVideo viewing sequence->Performing truncated length compensation to ensure that the video viewing sequence length of each user is the same, and the video viewing sequence length is +. >Specifically, when->At the time, from->Get the nearest L _U Video viewing sequence of individual videos as user +.>When->At the time, from->The left side is supplemented with length and +.>Video repetition of earliest view in (2)>Secondary, and insert->To the left of (a). For the processed user video viewing sequence +.>By means of user vertex capsule->Weighted combination is performed to output +.> Representing a user characteristic information +.>The transformation is performed to finally obtain the embedding +.> Representing a set of user characteristic information.

In one exemplary embodiment, the video feature information set may be obtained by, but is not limited to, adding the relationship feature to the fused feature information set by: inputting a target video adjacency matrix and the fusion characteristic information set into a target fusion network to obtain the video characteristic information set output by the target fusion network, wherein the target video adjacency matrix is used for representing the relationship characteristics among video vertices on video types, video labels and watching users and the association degree among videos connected by each relationship characteristic, and the target fusion network is used for updating each fusion characteristic information in the input fusion characteristic information set into corresponding video characteristic information according to the relationship characteristics represented by the target video adjacency matrix to obtain the video characteristic information set.

Alternatively, in this embodiment, the semantic relationship may be, but is not limited to, added to the fused feature information set through a target fusion network to obtain the video feature information set, and fig. 6 is a schematic diagram of a target fusion network according to an embodiment of the present application, as shown in the figure6, inputting a target video adjacency matrix (a) and the fusion characteristic information set (H) into a target fusion network to obtain the video characteristic information set (H) output by the target fusion network ^L+1 )。

In one exemplary embodiment, the target fusion network may include, but is not limited to: the input layer and the L layer graph capsule convolution layers, a first layer graph capsule convolution layer in the L layer graph capsule convolution layers comprises a basic video vertex capsule, the rest of the L layer graph capsule convolution layers comprise advanced video vertex capsules, an L layer graph capsule convolution layer in the L layer graph capsule convolution layers further comprises a final video vertex capsule, the basic video vertex capsule is used for executing a first convolution operation on each fusion feature information in a fusion feature information set according to a target video adjacency matrix to obtain a convolution feature vector set output by the first layer graph capsule convolution layer, the advanced video vertex capsule in the first layer graph capsule convolution layer is used for executing a second convolution operation on the received convolution feature vectors according to the target video adjacency matrix to obtain convolution feature vectors input by the advanced video vertex capsule in the first+1 layer graph capsule convolution layer, the advanced video vertex capsule is used for executing a third convolution operation on the convolution feature vectors output by the advanced video vertex capsule in the L layer graph capsule, and the third convolution operation is used for aggregating the convolution feature vectors to output the feature vectors.

Alternatively, in this embodiment, as shown in fig. 6, the target fusion network is a graph capsule neural network MSGCN, and is composed of an input layer and an L-layer graph capsule convolution layer. The first layer of graph capsule convolution layer comprises a basic video vertex capsule, and GNN is applied to extract local vertex characteristics with different receptive fields; all of the capsule convolutions of the first layer, except the capsule convolutions of the first layer, include advanced video vertex capsules to ensure that the first (l e 2, l]) Layer diagram capsule convolution layer outputThe feature dimension is always +.>In addition, the final video vertex capsule is also included in the final layer of the graph capsule convolution layer for outputting +.>The feature dimension of (2) always becomes +.>The following are specific designs of three capsules:

(1) Basic video vertex capsule. Multi-modal semantic graph in videoIn, consider +.>And its neighbor vertex set N (v _j ) Is a multi-modal fusion of graph convolution operations in specific dimensions of features, v being specified specifically for simplicity of discussion _j Contained in its set of neighboring vertices, i.e. v _j ∈N(v _j ). For a traditional graph neural network with L-layer graph roll stacking, we can note that its first (i E [1, L]Layer drawing volume lamination about v _j Is to do the graph convolution operation +.> The function is v _j And all its neighbors N (v _j ) D (d.epsilon.1, d) _l ]) The dimension characteristic value is used as input, and a new scalar ++is output after the calculation of the graph convolution>

Wherein,representing video v _j And v _k The weight of the connecting edge between them, i.e. v _j And v _k The similarity between the two is obtained by solving a VMG adjacency matrix random walk construction algorithm based on a meta-path. For video v _j All d of (2) _l After the graph rolling operation defined by the formula (11) is executed by the dimension input characteristic value, v is obtained _j Novel multimodal fusion profile->After the linear transformation and the nonlinear activation are carried out on the device, the device outputs + ->

Wherein,is a weight parameter to be learned; delta (x) is a nonlinear activation function such as the Relu function.

In order to capture more local information between a video and neighbors thereof, the patent designs a basic video vertex capsule based on a video multi-mode characteristic random variable higher-order statistical moment on the basis of traditional graph convolution operation Packaging these local information in so-called instantiation parameters, forming an informative basic video vertex capsule +.>/>

Wherein,representing the highest order of the random variable statistical moment of the multi-modal feature of the video; />And->Respectively represent video v _j And all its neighbors N (v) _j ) D (d.epsilon.1, d) _l ]) Mean and variance of the dimensional eigenvalues. Similarly, for v _j All d of (2) _l After the capsule convolution operation defined by the formula (13) is carried out on the dimension input characteristic value, v is obtained _j Novel multimodal fusion profile->Thus, for the vertex feature matrix consisting of all video multi-modal fusion features +.>The first layer of the designed graph capsule network will produce an outputWherein (1)>Graph capsule convolution operation representing a first layer with video multi-modality fusion feature H ¹ And VMG adjacency matrix->For input, the graph capsule convolution defined by equation (13)Matrix representation of operations. It is easy to find that with l (l.e. [1, L]) The increasing values will cause the output characteristic dimension of the subsequent capsule convolution layer to increase rapidly or even be too large to handle.

(2) Advanced video vertex capsule. For this purpose, it is proposed to limit the first (l.epsilon.1, L]) Layer diagram capsule convolution layer outputThe feature dimension is always +.>This can be achieved by: for graph capsule network at the first (l e 2, l]) Input received by the layer->For every video->Each dimension of feature d (d.epsilon.1, d _l ]) Through basic video vertex capsule function-> Vectorizing and outputting->Design advanced video vertex capsule function> Fix->The first two dimensions (not just described as +. >And d.epsilon.1, d _l ]Representing video vj and d-th dimensional features, respectively), for a total of q capsules in the next two dimensionsWeighted combination is performed to output +.>

/>

Wherein,indicating capsule->With capsule->The coupling coefficient between the two is calculated by the dynamic routing algorithm; />Is the weight parameter to be learned. For each videoEach of the dimensional characteristics of (2) is subjected to the transformations defined by formulas (14) to (15), and an output is finally obtainedThus, in short, the graph capsule network is updated at the first (l.e. [2, L]) The layer receives two inputsAnd->And generates an output +.>When l=l, the video multi-modal feature is output +.>

(3) And finally, video vertex capsules. Similarly, advanced video vertex capsule functions are designed in the manner specified by equations (14) - (15) Fix H ^L+1 First dimension, pair H ^L+1 Rear two-dimensional co-d ^L+1 Capsule->Implementing weighted combination, output->For every video->All the transformation is implemented, and finally all the multi-mode semantic enhancement embedding of the video is obtained>

In an exemplary embodiment, before the target video adjacency matrix and the fusion feature information set are input into a target fusion network to obtain the video feature information set output by the target fusion network, the method may, but is not limited to, further include the following manners: acquiring an initial fusion network; performing X-round video classification training on the initial fusion network to obtain a target pre-training fusion network, wherein X is a positive integer greater than or equal to 1, and the accuracy of the target pre-training fusion network on video classification is greater than the target accuracy; and performing Y-round video recommendation training on the target pre-training fusion network to obtain the target fusion network, wherein Y is a positive integer greater than or equal to 1.

Optionally, in this embodiment, the training method of the target fusion network includes a pre-training stage and a fine-tuning stage, and considering that a part of videos have category label information, a video classification task is designed as an auxiliary task to implement pre-training on the target fusion network.

In one exemplary embodiment, the initial fusion network may be, but is not limited to, subjected to an X-round video classification training to obtain a target pre-training fusion network by: executing an X-th round of video recommendation training in the X-round of video classification training on the initial fusion network through the following steps: in the x-th round of video recommendation training, classifying the video samples marked with the video type labels by using a pre-training fusion network obtained by the x-1-th round of video classification training to obtain classification results; generating a first target loss value according to the classification result and the video type label; and under the condition that the first target loss value does not meet a first preset convergence condition, adjusting network parameters of a pre-training fusion network used by an X-th round, determining the adjusted pre-training fusion network as a pre-training fusion network used by an x+1th round, and under the condition that the first target loss value meets the first preset convergence condition, determining the pre-training fusion network used by the X-th round as the target pre-training fusion network, wherein X is a positive integer greater than or equal to 1, X is a positive integer greater than or equal to 1 and less than or equal to X, and under the condition that the X takes the value of 1, the pre-training fusion network used by the X-th round is the initial fusion network.

Optionally, in this embodiment, the pre-training phase, specifically, includes the following procedure:

video is processedIs embedded with multi-modal semantic enhancements->(video sample labeled with video type tag) probability distribution pr input to classifier predictor category tag _j ：

/>

V learned by constraint _j Category label probability distribution of (2) and true label y thereof _j Similarly, design a supervised learning based pre-training loss function loss _p The following are provided:

the proposed MSGCN network is pre-trained according to a specific strategy such as random gradient descent (StochasticGradientDescent, SGD), momentum gradient descent (MomentumGradientDescent, MGD), nesterovMomentum, adaGrad, RMSprop and Adam (AdaptiveMomentEstimation), or batch gradient descent (BatchGradientDescent, BGD) to optimize the loss function value until the loss function achieves a minimum or the number of exercises reaches a specified iteration maximum, and the pre-training is ended. The network parameters are frozen. And then taking video recommendation as a main task to fine tune the pre-trained MSGCN network parameters.

In one exemplary embodiment, the target fusion network may be obtained by, but is not limited to, performing a Y-round video recommendation training on the target pre-training fusion network by: executing a Y-th round of video recommendation training in the Y-round of video recommendation training on the target pre-training fusion network through the following steps: in the video recommendation training of the y-th round, a reference fusion network obtained by the pre-training of the y-1-th round is used for generating the S+1st predicted video of a video viewing sequence sample based on the first S videos of the video viewing sequence sample, wherein the video viewing sequence sample is a known video viewing sequence and is used for recording the playing sequence of videos which are played in the video set by a corresponding user, the video viewing sequence sample comprises W videos, S is a positive integer which is greater than or equal to 1 and less than or equal to W, W is a positive integer which is greater than or equal to 1, and the reference fusion network used by the y-th round is the target pre-training fusion network under the condition that y takes the value of 1; generating a second target loss value according to the S+1st predicted video and the S+1st real video of the video watching sequence sample; and under the condition that the second target loss value does not meet a second preset convergence condition, adjusting network parameters of a reference fusion network used by the y-th round of video recommendation training, determining the adjusted reference fusion network as the reference fusion network used by the y+1th round of video recommendation training, and under the condition that the second target loss value meets the second preset convergence condition, determining the reference fusion network obtained by the y-th round of pre-training as the target fusion network.

Optionally, in this embodiment, the fine tuning stage, in particular, includes the following processes:

loss function loss for defining video recommendation tasks based on negative log-likelihood function _s Taking a video recommendation task as a main task, and utilizing a negative log likelihood function to make loss function loss of the proposed MSGCN network _s The definition is as follows:

wherein,indicating that the video belongs to a video in the video collection that has not been played by the target user,video viewing sequences for target users.

The proposed pre-trained MSGCN network parameters are modified and updated according to specific strategies such as random gradient descent (StochasticGradientDescent, SGD), momentum gradient descent (MomentumGradientDescent, MGD), nesterovMomentum, adaGrad, RMSprop and Adam (AdaptiveMomentEstimation), or batch gradient descent (BatchGradientDescent, BGD) to optimize the loss function value until the loss function achieves a minimum or the number of exercises reaches a specified iteration maximum. And the S+1st predicted video recommended by the S previous videos is consistent with the S+1st real video of the video watching sequence sample by the trained target fusion network.

In the technical solution provided in step S204, in the case of recommending a video for a target user in the user set, for example, user a, the target user feature information a corresponding to the target user is obtained from the user feature information set, and the video feature information to be played corresponding to the video to be played is obtained from the video feature information, so as to obtain a video feature information set to be played, where the video to be played is a video that is not played by the user a in the video set.

In the technical solution provided in step S206, according to the userVideo viewing sequence of (a)I.e. the user watches video +.1 at time step t =1>Watch video->Push in this way until->Video set not viewed from u +.>Selecting a video v with highest viewing probability as the videoHouse u is +.>Most likely video to be accessed. For video->The probability that user u views it in the next time step is:

wherein,for the target user characteristic information +.>And the video feature information is any one of the video feature information sets to be played.

In order to better understand the process of recommending the video, the following description is given in connection with the alternative embodiments, but the description is not limited to the technical solution of the embodiments of the present application.

In this embodiment, a video recommendation method is provided, and fig. 7 is a schematic diagram of a video recommendation flow according to an embodiment of the present application, as shown in fig. 7, mainly including the following steps:

step S701: video datasets are collected and collated. The video recommendation dataset Ω was collected and pre-processed, containing 40049 micro videos and 1935 different tags. Dividing the training set train, the verification set, the Valid and the test set Vtest according to the proportion of 60% (24029 videos), 20% (i.e., 8010 videos) and 20% (i.e., 8010 videos);

Step S702: multimodal information preprocessing. Extracting visual feature vectors, audio feature vectors and text feature vectors of each video Vj according to the visual, audio and text feature extraction method introduced by the medium-multi-mode information preprocessing module;

step S703: video recommendation system graph modeling. Extracting entities and relations thereof from the video recommendation system, and constructing a heterogeneous information network of the video recommendation system;

step S704: and constructing a video multi-mode semantic graph. Designing a meta-path, extracting rich semantic relations among videos from a heterogeneous information network according to a BVMG algorithm, and constructing a video multi-mode semantic graph;

step S705: the multi-modal capsule network is designed to fuse the multi-modal characteristics of a single video. Designing a multi-mode capsule network to fuse multi-mode characteristics of a single video;

step S706: the design drawing capsule neural network aggregates the multi-modal characteristics of different videos. The design drawing capsule neural network aggregates the multi-modal characteristics of different videos;

step S707: learning multi-modal semantics enhances user embedding. Designing a user vertex capsule network to extract multi-mode semantic enhancement user embedding;

step S708: and constructing a network model and designing a network loss function. And respectively constructing a multi-modal capsule network, a graph capsule neural network and a user vertex capsule network to form the multi-modal semantic enhancement graph capsule neural network. Designing a network loss function according to formulas (17) to (19);

Step S709: the network model is initialized and trained. The parameters of each layer of the MSGCN network are initialized according to a specific strategy such as normal distribution random initialization, xavier initialization or Heinitial initialization. Pre-training and fine-tuning the network;

step S710: video recommendation. For each user u, from the set of videos that the user u has not watched, one video v with the highest viewing probability is calculated and selected according to the formula (16), and recommended to the user.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the embodiments of the present application.

FIG. 8 is a block diagram of a video recommender in accordance with an embodiment of the present application; as shown in fig. 8, includes:

a first obtaining module 802, configured to obtain a set of video feature information, where the set of video feature information includes video feature information corresponding to each video in the set of videos, where the video feature information is used to characterize a multimodal fusion feature of the corresponding video and a relationship feature between the corresponding video and other videos in the set of videos, where the relationship feature includes features between videos in multiple video viewing dimensions, and the multimodal fusion feature includes features of the video itself in multiple modalities;

a determining module 804, configured to determine, in a case where a video is recommended to a target user in a user set, a target video to be recommended to the target user in the video set according to user feature information of the target user and the video feature information set;

and a recommending module 806, configured to recommend the target video to the target user.

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.

Through the embodiment, when the video is required to be recommended to the target user in the user set, the video feature information set is obtained, the video feature information set includes video feature information corresponding to each video in the video set, wherein each video feature information can represent a multi-mode fusion feature of the corresponding video and a relationship feature between the corresponding video and other videos in the video set, the relationship feature includes features of the videos in multiple video watching dimensions, the multi-mode fusion feature includes features of the video itself in multiple modes, and then the target video to be recommended to the target user is determined from the video set according to the user feature information and the video feature information set of the target user, and is recommended to the target user. The target video recommended by the method refers to the multimodal fusion characteristics of the target video and the relation characteristics between the target video and other videos in the video set, so that the matching degree of the recommended target video and a target user is higher. By adopting the technical scheme, the problems of low matching degree between the recommended video and the user and the like in the related technology are solved, and the technical effect of improving the matching degree between the recommended video and the user is realized.

In an exemplary embodiment, the first acquisition module includes:

the extraction unit is used for extracting features of semantic edges from a video multi-modal semantic graph of the video set to serve as the relation features, and acquiring the multi-modal fusion features of each video in the video set to obtain a fusion feature information set, wherein the video multi-modal semantic graph is used for displaying the relation features between the videos in the video set in a form of video vertexes and the semantic edges, each video vertex represents one video, and each semantic edge represents one relation feature;

and the adding unit is used for adding the relation features to the fusion feature information set to obtain the video feature information set.

In an exemplary embodiment, the extraction unit is further configured to:

updating an initial context co-occurrence matrix according to M vertex pair lists to obtain a target context co-occurrence matrix, wherein elements O of an mth row and a qth column in the target context co-occurrence matrix _mq Representing the number of times that an mth video and a q-th video co-occur in the same context, wherein the mth video is a video corresponding to an mth video vertex, the q-th video is a video corresponding to a q-th video vertex, the target context co-occurrence matrix is a symmetric square matrix of M, M and q are positive integers which are greater than or equal to 1 and less than or equal to M;

In an exemplary embodiment, the extraction unit is further configured to:

obtaining the number of vertex pairs consisting of the (r) th video vertex and the (t) th video vertex from the M vertex pair listsN _rt Wherein r and t are positive integers which are larger than or equal to 1 and smaller than or equal to M, and r is not equal to t;

In an exemplary embodiment, the extraction unit is further configured to:

extracting audio mode data of the c-th video;

In an exemplary embodiment, the extraction unit is further configured to:

In one exemplary embodiment, the determining module includes:

a first determining unit, configured to determine a similarity between each piece of video feature information in the video feature information set and the user feature information;

and the second determining unit is used for determining the video corresponding to the video characteristic information with the similarity larger than the target similarity threshold as the target video.

In an exemplary embodiment, the apparatus further comprises:

the second obtaining module is used for obtaining nth user characteristic information in the user characteristic information set before the target video to be recommended to the target user is determined in the video set according to the user characteristic information of the target user and the video characteristic information set, wherein the nth user characteristic information is used for indicating the characteristics of videos preferred by the nth user in the user set;

In an exemplary embodiment, the adding unit is further configured to:

In one exemplary embodiment, the target fusion network includes:

In an exemplary embodiment, the apparatus further comprises:

the third acquisition module is used for acquiring an initial fusion network before the target video adjacency matrix and the fusion characteristic information set are input into a target fusion network to obtain the video characteristic information set output by the target fusion network;

the first training module is used for performing X-round video classification training on the initial fusion network to obtain a target pre-training fusion network, wherein X is a positive integer greater than or equal to 1, and the accuracy rate of the target pre-training fusion network on video classification is greater than the target accuracy rate;

and the second training module is used for executing Y-round video recommendation training on the target pre-training fusion network to obtain the target fusion network, wherein Y is a positive integer greater than or equal to 1.

In one exemplary embodiment, the first training module includes:

the first training unit is used for executing the X-th round of video recommendation training in the X-round of video classification training on the initial fusion network through the following steps:

In one exemplary embodiment, the second training module includes:

the second training unit is used for executing a Y-th round of video recommendation training in the Y-round of video recommendation training on the target pre-training fusion network through the following steps:

Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-only memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

Embodiments of the present application also provide an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

In an exemplary embodiment, the electronic device may further include a transmission device connected to the processor, and an input/output device connected to the processor.

Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principles of the present application should be included in the protection scope of the present application.

Claims

1. A method for recommending video, comprising:

Recommending the target video to the target user;

wherein, the obtaining the video characteristic information set includes:

adding the relation features to the fusion feature information set to obtain the video feature information set;

the extracting features of semantic edges from the multi-modal semantic graphs of videos in the video set as the relationship features, and obtaining the multi-modal fusion features of each video in the video set to obtain a fusion feature information set includes:

Fusing the characteristics of each video in the video set on a plurality of modes into fused characteristic information to obtain a fused characteristic information set, wherein the fused characteristic information is used for representing the multi-mode fused characteristics of the corresponding video;

the fusing the features of each video in the video set on a plurality of modes into fused feature information comprises the following steps:

fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video into a c-th target fusion feature vector, and taking the c-th target fusion feature vector as the c-th fusion feature information;

The fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video into a c-th target fusion feature vector includes:

Fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video by using the target capsule network fusion model to obtain the c-th target fusion feature vector;

the adding the relationship feature to the fusion feature information set to obtain the video feature information set includes:

inputting a target video adjacency matrix and the fusion characteristic information set into a target fusion network to obtain the video characteristic information set output by the target fusion network, wherein the target video adjacency matrix is used for representing the relation characteristics among video vertices on video types, video labels and watching users and the association degree among videos connected by each relation characteristic, and the target fusion network is used for updating each fusion characteristic information in the input fusion characteristic information set into corresponding video characteristic information according to the relation characteristics represented by the target video adjacency matrix to obtain the video characteristic information set;

wherein the target converged network comprises:

input layerLLayer diagram the capsule convolutions of the layers, LA first one of the layer graph capsule convolution layers includes a base video vertex capsule,Llayer picture capsule rollThe remaining graph capsule convolution layers in the overlay layer include advanced video vertex capsules,Llayer diagram capsule convolution layerLThe layer diagram capsule convolution layer further comprises a final video vertex capsule, the basic video vertex capsule is used for executing a first convolution operation on each fusion characteristic information in the fusion characteristic information set according to the target video adjacent matrix to obtain a convolution fusion characteristic vector set output by the first layer diagram capsule convolution layer, and the first layer diagram capsule convolution layer is used for outputting the fusion characteristic vector setlThe advanced video vertex capsule in the layer graph capsule convolution layer is used for executing second convolution operation on the received convolution fusion feature vector according to the target video adjacency matrix to obtain the first convolution operationThe high-level video vertex capsule in the layer diagram capsule convolution layer is used for inputting convolution fusion feature vectors, and the high-level video vertex capsule is used for the first layer of the layer diagramLAnd performing a third convolution operation on the convolution fusion feature vectors output by the advanced video vertex capsules in the layer graph capsule convolution layer, wherein the third convolution operation is used for aggregating the convolution fusion feature vectors and outputting the video feature information set.

2. The method of claim 1, wherein said converting the video multimodal semantic graph into a target video adjacency matrix comprises:

the video multi-mode semantic graph is displayedMThe first of the video verticesiTaking the video vertexes as root vertexes, taking the root vertexes as starting points of a random walk construction algorithm, and expanding random walks with preset path lengths according to a meta-path set, preset restart probability and the transition probability matrixPNext, get the firstiCorresponding to each video vertexQLong walkA path, wherein,Qthe long travel path forms the firstiThe context of each video vertex, the set of meta-paths including meta-paths representing characteristics of relationships between each video in the set of videos and other videos, the restart probability indicating a probability that the random walk will jump back to the starting point every step during each of the random walks,iis greater than or equal to 1 and less than or equal to MIs a positive integer of (a) and (b),Qis less than or equal toPIs a positive integer of (2);

sequentially toQSampling the long traveling path by a preset window size to obtain the firstiThe first video vertexiA list of vertex pairs, wherein the first vertex pairiWhen each sampling is recorded in the vertex pair list, a pair of video vertices at two ends of the sampling are sampled, wherein the length of the size of the preset window is more than 2 and less than the preset path length;

according toMCorresponding to each video vertexMThe list of vertex pairs generates the target video adjacency matrix.

3. The method of claim 2, wherein the random walk with the root vertices as a starting point for a random walk construction algorithm develops a random walk of a predetermined path length according to a set of meta paths, a predetermined restart probability, and the transition probability matrixPSecondary, comprising:

randomly taking one element path which does not participate in the random walk currently from the element path set as a walk element path, taking the root vertex as a starting point of a random walk construction algorithm, and expanding the random walk with preset path length according to the walk element path taken out from the element path set, preset restarting probability and the transition probability matrix PAnd secondly, until all meta-paths in the meta-path set participate in the random walk, wherein the meta-paths in the meta-path set comprise: a first element path, a second element path, a third element path and a fourth element path, wherein the first element path, the second element path, the third element path and the fourth element path are respectively listed in sequenceThe same type of relation, same tag relation, same viewing relation and friend viewing relation in the video multi-mode semantic graph are shown, wherein the same type of relation represents that 2 videos are the same in type, the same tag relation represents that 2 videos are the same in tag, the same viewing relation represents that 2 videos are watched by the same user in the user set, and the friend viewing relation represents that 2 videos are watched by 1 pair of friends in the user set together.

4. The method according to claim 2, wherein the step ofMCorresponding to each video vertexMGenerating the target video adjacency matrix from the list of vertex pairs, including:

according toMUpdating an initial context co-occurrence matrix by using a vertex pair list to obtain a target context co-occurrence matrix, wherein the target context co-occurrence matrix is the first one in the target context co-occurrence matrix mLine 1qElements of columnso _mq Represent the firstmVideo and the firstqThe number of times that the video co-appears in the same said context, said firstmThe video is the firstmVideo corresponding to each video vertex, the firstqThe video is the firstqThe video corresponding to each video vertex, the target context co-occurrence matrix isM*MIs a symmetric square matrix of the number (1),m、qis greater than or equal to 1 and less than or equal toMIs a positive integer of (2);

5. The method according to claim 4, wherein the step ofMUpdating the initial context co-occurrence matrix by the vertex pair list to obtain a target context co-occurrence matrix, wherein the method comprises the following steps of:

from the slaveMThe vertex pair list is obtained by the first vertex pair listrVideo vertices and the thtNumber of vertex pairs consisting of video verticesWherein, the method comprises the steps of, wherein,r、tis greater than or equal to 1 and less than or equal toEqual toMIs a positive integer of (a) and (b),rnot equal tot；

Co-occurrence of elements in the initial context matrixAnd->The values of (2) are respectively increased +.>And obtaining the target context co-occurrence matrix, wherein all elements of the initial context co-occurrence matrix are 0.

6. The method according to claim 1, wherein the extracting the first cA visual feature vector for a video, comprising:

sampling the first time interval in a first preset time interval modecSampling the video to obtain the first videocCorresponding to the videoA frame picture;

the saidInputting each frame of picture into an image feature extraction model to obtain +.f outputted by the image feature extraction model>Picture feature vectors;

according to the describedGenerating the first picture feature vectorcThe visual feature vectors of the individual videos.

7. The method according to claim 1, wherein the extracting the firstcAn audio feature vector for a video, comprising:

extracting the firstcAudio modality data of the individual videos;

dividing the audio mode data into the audio mode data according to a second preset time interval based on the time dimensionSegment audio modality data;

the saidInputting each segment of sub-audio mode data in the segment of sub-audio mode data into an audio feature extraction model to obtain +.>Audio segment feature vectors;

according to the describedGenerating the first audio segment feature vectorcThe audio feature vectors of the individual videos.

8. The method according to claim 1, wherein the extracting the first cA text feature vector for a video, comprising:

from the firstcExtracting the first video from the associated textcCorresponding to the videoVideo texts;

the saidEach video text in the video texts is input into a text feature extraction model to obtain +.>A text segment feature vector;

according to the describedThe text segment feature vector generates the firstcThe text feature vectors of the individual videos.

9. The method according to claim 1, wherein the first step is usedcThe visual feature vector, the audio feature vector and the text feature vector corresponding to each video carry out weight parameters of the capsule network fusion modelDAdjustment of the wheel, comprising:

the first step is used bycThe visual feature vector, the audio feature vector and the text feature vector corresponding to each video carry out the first step on the weight parameters of the capsule network fusion modeldAdjustment of the wheel:

completing the weight parameterdThe capsule network fusion model obtained after the adjustment of the round-1 is determined as the firstdA capsule network fusion model for wheel fusion;

putting the first stepcThe visual feature vector, the audio feature vector and the text feature vector corresponding to each video are input to the first video dThe capsule network fusion model used for wheel fusion is obtaineddThe reference fusion feature vector is output by a capsule network fusion model used for wheel fusion;

pair of reference fusion feature vectorsdAdjusting the weight parameters of the capsule network fusion model used for the wheel fusion to obtain a to-be-obtained modeld+1 round of capsule network fusion model for fusion use.

10. The method of claim 9, wherein said combining said firstcThe visual feature vector, the audio feature vector and the text feature vector corresponding to each video are input to the first videodThe capsule network fusion model used for wheel fusion is obtaineddReference fusion characteristics output by capsule network fusion model for wheel fusionA vector, comprising:

said firstdThe capsule network fusion model for wheel fusion is prepared by the following stepscThe visual feature vector, the audio feature vector and the text feature vector corresponding to each video are fused to obtain the first videodAnd (3) outputting a reference fusion feature vector by a capsule network fusion model for wheel fusion:

Performing an accumulation and calculation on the weighted visual feature vector, the weighted audio feature vector and the weighted text feature vector to obtain a weighted fusion feature vector, wherein the weighted visual feature vector is the first onedThe visual weight parameter and the linear visual feature vector of the capsule network fusion model used for the wheel fusion are obtained by performing an outer product operation, and the weighted audio feature vector is obtained by the first stepdThe audio weight parameter of the capsule network fusion model used for the round fusion and the linear audio feature vector are obtained by performing an outer product operation, and the weighted text feature vector is the firstdThe text weight parameter and the linear text feature vector of the capsule network fusion model used for the round fusion are obtained by performing an outer product operation;

11. The method of claim 10, wherein the using reference fusion feature vector pairdAdjusting the weight parameters of the capsule network fusion model used for the wheel fusion to obtain a to-be-obtained modeldA capsule network fusion model for +1 round fusion use comprising:

Will be the firstdPerforming addition operation on the visual weight parameters of the capsule network fusion model used for wheel fusion and the visual adjustment parameters to obtain a to-be-obtained capsule network fusion modeldThe visual weight parameter of the capsule network fusion model for +1 round fusion is the firstdPerforming addition operation on the audio weight parameters of the capsule network fusion model used for wheel fusion and the audio adjustment parameters to obtain a to-be-obtained capsule network fusion modeldThe audio weight parameter of the capsule network fusion model used in +1 round fusion is the firstdPerforming addition operation on the text weight parameters of the capsule network fusion model used for the round fusion and the text adjustment parameters to obtain a to-be-obtained capsule network fusion modeld+1 round of text weight parameters of the capsule network fusion model used for fusion.

12. The method of claim 1, wherein the determining a target video in the video set to be recommended to the target user based on the user characteristic information of the target user and the video characteristic information set comprises:

13. The method of claim 1, wherein prior to the determining a target video in the video set to be recommended to the target user based on the user characteristic information of the target user and the video characteristic information set, the method further comprises:

the method comprises the following steps of obtaining the first user characteristic information setnPersonal user characteristic information, whereinnIndividual user characteristic information for indicating the userThe first in the collectionnFeatures of the video preferred by the individual users;

acquiring the firstnThe corresponding first of the usersnA video viewing sequence, wherein the video viewing sequence records the playing order of the video of the played video of the corresponding user, and the user set comprisesNThe number of users who are to be served,nis greater than or equal to 1 and less than or equal toNIs a positive integer of (2);

acquiring the first video characteristic information from the video characteristic information setnThe video characteristic information corresponding to each video in each video watching sequence is obtainednA set of reference video feature information;

will be the firstnMerging all the reference video feature information in the reference video feature information sets into a feature vector to obtain the first video feature information setnPersonal user characteristic information.

14. The method of claim 1, wherein prior to said inputting the target video adjacency matrix and the set of fusion feature information into a target fusion network to obtain the set of video feature information output by the target fusion network, the method further comprises:

acquiring an initial fusion network;

executing on the initial converged networkXTraining the video classification to obtain a target pre-training fusion network,Xthe accuracy rate of the target pre-training fusion network on video classification is larger than the target accuracy rate for positive integers which are larger than or equal to 1;

performing on the target pre-training fusion networkYTraining is recommended to the round of video to obtain the target fusion network, wherein,Yis a positive integer greater than or equal to 1.

15. The method of claim 14, wherein the performing the initial converged networkXThe round of video classification training, the goal pre-training fusion network is obtained, comprising:

executing the initial converged network by the following stepsXWheel video classificationTraining of the firstxTraining is recommended to the round video:

in the first placexIn the round video recommendation training, the first is usedxThe pre-training fusion network obtained by 1 rounds of video classification training classifies video samples marked with video type labels to obtain classification results;

adjusting the first target loss value under the condition that the first target loss value does not meet a first preset convergence conditionxNetwork parameters of the round-robin pre-training fusion network, and determining the adjusted pre-training fusion network as the firstxA +1 round pre-training fusion network, wherein the first target loss value meets a first preset convergence conditionxA round-robin pre-training fusion network is determined as the target pre-training fusion network, wherein,Xis a positive integer greater than or equal to 1,xis greater than or equal to 1 and less than or equal toXIn the positive integer of (2)xUnder the condition of taking the value of 1, the firstxThe round-robin pre-training fusion network is the initial fusion network.

16. The method of claim 14, wherein the performing the target pre-training fusion networkYTraining is recommended to the round of video to obtain the target fusion network, which comprises the following steps:

executing the target pre-training fusion network by the following stepsYWheel video recommendation trainingyTraining is recommended to the round video:

in the first placeyIn the round video recommendation training, the first is usedy-1 round of pre-training of the obtained reference fusion network based on the front of the video viewing sequence samples SVideo, generating the video watching sequence sampleS+1 predicted videos, wherein the video viewing sequence sample is a known video viewing sequence for recording the playing order of videos played by corresponding users in the video collection, and comprisesWThe video content of the video is recorded in a video file,Sis greater than or equal to 1 and less than or equal toIn the followingWIs a positive integer of (a) and (b),Wis a positive integer greater than or equal to 1, inyUnder the condition of taking the value of 1, the firstyThe round-used reference fusion network is the target pre-training fusion network;

according to the firstS+1 predicted videos and the video viewing sequence sample numberSGenerating a second target loss value by +1 real videos;

in case the second target loss value does not meet a second preset convergence condition, adjusting the second target loss valueyNetwork parameters of a reference fusion network used for video recommendation training, and determining the adjusted reference fusion network as a reference fusion network to be used for the first timeyThe reference fusion network used for +1 round of video recommendation training is used for carrying out the first target loss value under the condition that the second target loss value meets a second preset convergence conditionyAnd determining the reference fusion network obtained through round pre-training as the target fusion network.

17. A video recommendation device, comprising:

the recommending module is used for recommending the target video to the target user;

wherein, the first acquisition module includes:

The adding unit is used for adding the relation features to the fusion feature information set to obtain the video feature information set;

wherein, the extraction element is further used for:

wherein, add the unit, still be used for:

wherein the target converged network comprises:

input layerLLayer diagram the capsule convolutions of the layers,La first one of the layer graph capsule convolution layers includes a base video vertex capsule, LThe remaining ones of the layer graph capsule convolution layers include advanced video vertex capsules,Llayer diagram capsule convolution layerLThe layer diagram capsule convolution layer further comprises a final video vertex capsule, the basic video vertex capsule is used for executing a first convolution operation on each fusion characteristic information in the fusion characteristic information set according to the target video adjacent matrix to obtain a convolution fusion characteristic vector set output by the first layer diagram capsule convolution layer, and the first layer diagram capsule convolution layer is used for outputting the fusion characteristic vector setlThe advanced video vertex capsule in the layer graph capsule convolution layer is used for executing second convolution operation on the received convolution fusion feature vector according to the target video adjacency matrix to obtain the first convolution operationThe high-level video vertex capsule in the layer diagram capsule convolution layer is used for inputting convolution fusion feature vectors, and the high-level video vertex capsule is used for the first layer of the layer diagramLAnd performing a third convolution operation on the convolution fusion feature vectors output by the advanced video vertex capsules in the layer graph capsule convolution layer, wherein the third convolution operation is used for aggregating the convolution fusion feature vectors and outputting the video feature information set.

18. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run performs the method of any one of claims 1 to 16.

19. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to perform the method of any of claims 1 to 16 by means of the computer program.