CN117112834A

CN117112834A - Video recommendation method and device, storage medium and electronic device

Info

Publication number: CN117112834A
Application number: CN202311384218.7A
Authority: CN
Inventors: 胡克坤; 董刚; 曹其春; 杨宏斌
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2023-11-24
Anticipated expiration: 2043-10-24
Also published as: CN117112834B

Abstract

The application discloses a video recommendation method and device, a storage medium and an electronic device, wherein the video recommendation method comprises the following steps: the method comprises the steps of obtaining a video feature information set, wherein the video feature information set comprises video feature information corresponding to each video in the video set, the video feature information is used for representing multi-mode fusion features of the corresponding video and relationship features between the corresponding video and other videos in the video set, the relationship features comprise features of the videos in multiple video watching dimensions, and the multi-mode fusion features comprise features of the video itself in multiple modes; under the condition that the video is recommended to the target user in the user set, determining the target video to be recommended to the target user in the video set according to the user characteristic information and the video characteristic information set of the target user; by adopting the technical scheme, the problems of low matching degree between the recommended video and the user and the like in the related technology are solved.

Description

Video recommendation method and device, storage medium and electronic device

Technical Field

The embodiment of the application relates to the field of computers, in particular to a video recommendation method and device, a storage medium and an electronic device.

Background

With the rapid popularization of internet technology, the rapid development of multimedia technology and the daily and monthly variation of social networks, as a new social form, "video social" is rapidly spreading. Unlike traditional social networks, social forms in video social networks are no longer constrained to text and pictures, but can also be live through posting video. The user can watch, comment and share the video on the video software platform, and can communicate with the video creator, so that the mental culture life of the user is greatly enriched. However, the increasingly complex video types and increasing numbers of videos, while giving users more choices, also create serious information overload problems. How to solve the problem, the user can find the favorite content in the video, so as to meet the personalized requirements of the user, and the method is a great challenge for the recommendation system of the video social platform.

The conventional video recommendation method mainly uses interactive data between users and videos to implement recommendation, and typical recommendation methods include a collaborative filtering-based method, a content-based method and a hybrid method. They typically extract embedded representations of the user and/or video from the auxiliary data by means of manual feature engineering and then feed into models of factoring machines, gradient hoists, etc. to predict the user's preferences for the video. The video recommendation method based on deep learning utilizes the strong representation learning capability of the neural network to learn the representation of a user and/or an object from the object auxiliary information, and then predicts based on the similarity of the user and the video, but most of the video recommendation methods only consider the video specific type auxiliary information, do not fully utilize the complete multi-mode auxiliary information of the video and the semantic relationship among the videos, and have unsatisfactory recommendation effect.

Aiming at the problems of low matching degree between the recommended video and the user and the like in the related technology, no effective solution has been proposed yet.

Disclosure of Invention

The embodiment of the application provides a video recommending method and device, a storage medium and an electronic device, which are used for at least solving the problems of low matching degree between a recommended video and a user and the like in the related technology.

According to an embodiment of the present application, there is provided a video recommendation method including:

acquiring a video feature information set, wherein the video feature information set comprises video feature information corresponding to each video in the video set, the video feature information is used for representing multi-modal fusion features of the corresponding video and relationship features between the corresponding video and other videos in the video set, the relationship features comprise features of the videos in multiple video watching dimensions, and the multi-modal fusion features comprise features of the video itself in multiple modalities;

under the condition that video is recommended to a target user in a user set, determining a target video to be recommended to the target user in the video set according to user characteristic information of the target user and the video characteristic information set;

Recommending the target video to the target user.

Optionally, the acquiring the video feature information set includes:

extracting features of semantic edges from a video multi-modal semantic graph of the video set as the relationship features, and acquiring the multi-modal fusion features of each video in the video set to obtain a fusion feature information set, wherein the video multi-modal semantic graph is used for displaying the relationship features between videos in the video set in the form of video vertices and the semantic edges, each video vertex represents one video, and each semantic edge represents one relationship feature;

and adding the relation features to the fusion feature information set to obtain the video feature information set.

Optionally, the extracting features of semantic edges from the multi-modal semantic graphs of videos in the video set as the relationship features, and obtaining the multi-modal fusion features of each video in the video set to obtain a fusion feature information set includes:

converting the video multi-mode semantic graph into a target video adjacency matrix, and obtaining the relation features according to the similarity degree of features between any two video vertexes in the video multi-mode semantic graph represented by the target video adjacency matrix in a plurality of video watching dimensions;

And fusing the characteristics of each video in the video set on a plurality of modes into fused characteristic information to obtain the fused characteristic information set, wherein the fused characteristic information is used for representing the multi-mode fused characteristics of the corresponding video.

Optionally, the converting the video multimodal semantic graph into the target video adjacency matrix includes:

acquiring a transition probability matrix corresponding to the video multi-mode semantic graph, wherein the transition probability matrix is used for indicating the transition probability of a random walker from one video vertex to each adjacent video vertex in the process of using a random walk construction algorithm to walk the video multi-mode semantic graph;

taking an ith video vertex in M video vertices in the video multi-mode semantic graph as a root vertex, taking the root vertex as a starting point of a random walk construction algorithm, expanding random walks with preset path length for P times according to a meta-path set, preset restart probability and the transition probability matrix to obtain Q long walk paths corresponding to the ith video vertex, wherein the Q long walk paths form a context of the ith video vertex, the meta-path set comprises meta-paths used for representing relation characteristics between each video and other videos in the video set, the restart probability is used for indicating the probability of each step of the random walk to jump back to the starting point in the process of each random walk, i is a positive integer greater than or equal to 1 and less than or equal to M, and Q is a positive integer less than or equal to P;

Sampling the size of a preset window for the Q long-distance walking paths in sequence to obtain an ith vertex pair list of the ith video vertex, wherein when each sampling is recorded in the ith vertex pair list, a pair of video vertices at two ends of the sampling are sampled, and the length of the size of the preset window is more than 2 and less than the length of the preset path;

and generating the target video adjacency matrix according to M vertex pair lists corresponding to the M video vertices.

Optionally, the expanding the random walk with the preset path length for P times by taking the root vertex as a starting point of a random walk construction algorithm according to a meta-path set, a preset restart probability and the transition probability matrix includes:

randomly taking a meta-path which does not participate in the random walk from the meta-path set as a walk meta-path, taking the root vertex as a starting point of a random walk construction algorithm, and expanding the random walk with a preset path length for P times according to the walk meta-path, the preset restart probability and the transition probability matrix which are taken out from the meta-path set until all meta-paths in the meta-path set participate in the random walk, wherein the meta-paths in the meta-path set comprise: the video multi-mode semantic graph comprises a first meta-path, a second meta-path, a third meta-path and a fourth meta-path, wherein the first meta-path, the second meta-path, the third meta-path and the fourth meta-path sequentially represent the same type of relation, the same label relation, the same watching relation and the good friend watching relation in the video multi-mode semantic graph respectively, the same type of relation represents that 2 videos are the same, the same label relation represents that the labels of 2 videos are the same, the same watching relation represents that 2 videos are watched by the same user in the user set, and the friend watching relation represents that 2 videos are watched by 1 pair of friends in the user set.

Optionally, the generating the target video adjacency matrix according to M vertex pair lists corresponding to M video vertices includes:

according to M vertex pair listsUpdating an initial context co-occurrence matrix to obtain a target context co-occurrence matrix, wherein the element o of the mth row and the q-th column in the target context co-occurrence matrix _mq Representing the number of times that an mth video and a q-th video co-occur in the same context, wherein the mth video is a video corresponding to an mth video vertex, the q-th video is a video corresponding to a q-th video vertex, the target context co-occurrence matrix is a symmetric square matrix of M, M and q are positive integers which are greater than or equal to 1 and less than or equal to M;

and generating the target video adjacency matrix according to the target context co-occurrence matrix.

Optionally, the updating the initial context co-occurrence matrix according to the M vertex pair lists to obtain the target context co-occurrence matrix includes:

obtaining the number of vertex pairs consisting of the (r) th video vertex and the (t) th video vertex from the M vertex pair listsWherein r and t are positive integers which are larger than or equal to 1 and smaller than or equal to M, and r is not equal to t;

co-occurrence of elements in the initial context matrix And->The values of (2) are respectively increased +.>And obtaining the target context co-occurrence matrix, wherein all elements of the initial context co-occurrence matrix are 0.

Optionally, the fusing the features of each video in the video set on multiple modalities into fused feature information includes:

the c fusion characteristic information of a c-th video in M videos included in the video set is obtained through the following steps, wherein c is a positive integer greater than or equal to 1 and less than or equal to M:

extracting a visual feature vector of the c-th video, wherein the visual feature vector is used for representing the features of the c-th video under the self visual mode;

extracting an audio feature vector of the c-th video, wherein the audio feature vector is used for representing the feature of the c-th video under the audio mode of the c-th video;

extracting a text feature vector of the c-th video, wherein the text feature vector is used for representing the feature of the c-th video under the own text mode;

and fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video into a c-th target fusion feature vector, and taking the c-th target fusion feature vector as the c-th fusion feature information.

Optionally, the extracting the visual feature vector of the c-th video includes:

sampling the c-th video by adopting a first preset time interval sampling mode to obtain a corresponding c-th videoA frame picture;

the saidInputting each frame of picture into an image feature extraction model to obtain +.f outputted by the image feature extraction model>Picture feature vectors;

according to the describedA picture feature vector generates the visual feature vector of the c-th video.

Optionally, the extracting the audio feature vector of the c-th video includes:

extracting audio mode data of the c-th video;

dividing the audio mode data into the audio mode data according to a second preset time interval based on the time dimensionSegment audio modality data;

the saidInputting each segment of sub-audio mode data in the segment of sub-audio mode data into an audio feature extraction model to obtain +.>Audio segment feature vectors;

according to the describedAn audio segment feature vector generates the audio feature vector for the c-th video.

Optionally, the extracting the text feature vector of the c-th video includes:

Extracting the text associated with the c-th video from the text associated with the c-th videoVideo texts;

the saidEach video text in the video texts is input into a text feature extraction model to obtain +.>A text segment feature vector;

according to the describedAnd generating the text feature vector of the c-th video by using the text segment feature vector.

Optionally, the fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video into a c-th target fusion feature vector includes:

the method comprises the steps that a weight parameter of a capsule network fusion model is subjected to D round adjustment by using a visual feature vector, an audio feature vector and a text feature vector corresponding to a c-th video to obtain a target capsule network fusion model, wherein the weight parameter comprises a visual weight parameter, an audio weight parameter and a text weight parameter, the visual weight parameter is used for indicating the weight of the feature of the video in the visual mode of the video in the process of fusion of the feature by the capsule network fusion model, the audio weight parameter is used for indicating the weight of the feature of the video in the audio mode of the video in the process of fusion of the feature by the capsule network fusion model, the text weight parameter is used for indicating the weight of the feature of the video in the text mode of the video in the process of fusion of the feature by the capsule network fusion model, the capsule network fusion model used for the D-th round is a capsule network fusion model obtained after the D-1 round adjustment of the weight parameter is a preset positive integer greater than or equal to 1, and D is a positive integer less than or equal to D, and when D is taken to 1, the capsule network fusion model is a capsule network model with the weight parameter which is not adjusted initially;

And fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video by using the target capsule network fusion model to obtain the c-th target fusion feature vector.

Optionally, the adjusting the weight parameter of the capsule network fusion model by using the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video includes:

the method comprises the following steps of carrying out d-th round adjustment on the weight parameters of a capsule network fusion model by using the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video:

determining a capsule network fusion model obtained after the d-1 th round of adjustment of the weight parameters is completed as a capsule network fusion model used for the d round of fusion;

inputting the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video into the d-th wheel fusion used capsule network fusion model to obtain a reference fusion feature vector output by the d-th wheel fusion used capsule network fusion model;

and adjusting the weight parameters of the capsule network fusion model used in the d-th round of fusion by using the reference fusion feature vector to obtain the capsule network fusion model to be used in the d+1-th round of fusion.

Optionally, the inputting the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video to the d-th-round fusion used capsule network fusion model to obtain a reference fusion feature vector output by the d-th-round fusion used capsule network fusion model includes:

the capsule network fusion model for the d-th round fusion uses the following steps to fuse the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video to obtain a reference fusion feature vector output by the capsule network fusion model for the d-th round fusion:

respectively carrying out linear transformation on the visual feature vector, the audio feature vector and the text feature vector to obtain corresponding linear visual feature vector, linear audio feature vector and linear text feature vector;

performing accumulation and calculation on a weighted visual feature vector, a weighted audio feature vector and a weighted text feature vector to obtain a weighted fusion feature vector, wherein the weighted visual feature vector is obtained by performing outer product operation on a visual weight parameter and a linear visual feature vector of a capsule network fusion model used by the d-th round fusion, the weighted audio feature vector is obtained by performing outer product operation on an audio weight parameter and a linear audio feature vector of a capsule network fusion model used by the d-th round fusion, and the weighted text feature vector is obtained by performing outer product operation on a text weight parameter and a linear text feature vector of a capsule network fusion model used by the d-th round fusion;

And converting the weighted fusion feature vector into the reference fusion feature vector by adopting a nonlinear activation function.

Optionally, the adjusting the weight parameter of the capsule network fusion model used in the d-th round of fusion by using the reference fusion feature vector to obtain a capsule network fusion model to be used in the d+1-th round of fusion includes:

obtaining adjustment parameters, wherein the adjustment parameters comprise visual adjustment parameters, audio adjustment parameters and text adjustment parameters, the visual adjustment parameters are obtained by performing outer product operation on the linear visual feature vector and the weighted fusion feature vector, the audio adjustment parameters are obtained by performing outer product operation on the linear audio feature vector and the weighted fusion feature vector, and the text adjustment parameters are obtained by performing outer product operation on the linear text feature vector and the weighted fusion feature vector;

and performing addition operation on the visual weight parameters of the capsule network fusion model used in the d-th round of fusion and the visual adjustment parameters to obtain the visual weight parameters of the capsule network fusion model used in the d+1-th round of fusion, performing addition operation on the audio weight parameters of the capsule network fusion model used in the d-th round of fusion and the audio adjustment parameters to obtain the audio weight parameters of the capsule network fusion model used in the d+1-th round of fusion, and performing addition operation on the text weight parameters of the capsule network fusion model used in the d-th round of fusion and the text adjustment parameters to obtain the text weight parameters of the capsule network fusion model used in the d+1-th round of fusion.

Optionally, the determining, in the video set, the target video to be recommended to the target user according to the user feature information of the target user and the video feature information set includes:

determining the similarity between each piece of video characteristic information in the video characteristic information set and the user characteristic information;

and determining the video corresponding to the video characteristic information with the similarity larger than the target similarity threshold as the target video.

Optionally, before determining the target video to be recommended to the target user in the video set according to the user feature information of the target user and the video feature information set, the method further includes:

acquiring nth user characteristic information in the user characteristic information set, wherein the nth user characteristic information is used for indicating the characteristics of videos preferred by the nth user in the user set;

acquiring an nth video viewing sequence corresponding to the nth user, wherein the video viewing sequence records the playing sequence of the video of the played video of the corresponding user, and the user set comprises N users, wherein N is a positive integer which is greater than or equal to 1 and less than or equal to N;

Acquiring the video characteristic information corresponding to each video in the nth video watching sequence from the video characteristic information set to obtain an nth reference video characteristic information set;

and merging all the reference video feature information in the nth reference video feature information set into a feature vector to obtain the nth user feature information.

Optionally, the adding the relationship feature to the fused feature information set to obtain the video feature information set includes:

inputting a target video adjacency matrix and the fusion characteristic information set into a target fusion network to obtain the video characteristic information set output by the target fusion network, wherein the target video adjacency matrix is used for representing the relationship characteristics among video vertices on video types, video labels and watching users and the association degree among videos connected by each relationship characteristic, and the target fusion network is used for updating each fusion characteristic information in the input fusion characteristic information set into corresponding video characteristic information according to the relationship characteristics represented by the target video adjacency matrix to obtain the video characteristic information set.

Optionally, the target fusion network includes:

the input layer and the L layer graph capsule convolution layers, a first layer graph capsule convolution layer in the L layer graph capsule convolution layers comprises basic video vertex capsules, the rest of graph capsule convolution layers in the L layer graph capsule convolution layers comprise advanced video vertex capsules, an L layer graph capsule convolution layer in the L layer graph capsule convolution layers further comprises final video vertex capsules, the basic video vertex capsules are used for executing first convolution operation on each fusion feature information in a fusion feature information set according to a target video adjacent matrix to obtain a convolution fusion feature vector set output by the first layer graph capsule convolution layer, the advanced video vertex capsules in the first layer graph capsule convolution layer are used for executing second convolution operation on received convolution fusion feature vectors according to the target video adjacent matrix to obtain a third convolution operationThe convolution fusion feature vector input by the advanced video vertex capsule in the layer graph capsule convolution layer is used for executing a third convolution operation on the convolution fusion feature vector output by the advanced video vertex capsule in the layer graph capsule convolution layer, and the third convolution operation is used for aggregating the convolution fusion feature vector and outputting the video feature information set.

Optionally, before the target video adjacency matrix and the fusion feature information set are input to a target fusion network to obtain the video feature information set output by the target fusion network, the method further includes:

acquiring an initial fusion network;

performing X-round video classification training on the initial fusion network to obtain a target pre-training fusion network, wherein X is a positive integer greater than or equal to 1, and the accuracy of the target pre-training fusion network on video classification is greater than the target accuracy;

and performing Y-round video recommendation training on the target pre-training fusion network to obtain the target fusion network, wherein Y is a positive integer greater than or equal to 1.

Optionally, the performing the X-round video classification training on the initial fusion network to obtain a target pre-training fusion network includes:

executing an X-th round of video recommendation training in the X-round of video classification training on the initial fusion network through the following steps:

in the x-th round of video recommendation training, classifying the video samples marked with the video type labels by using a pre-training fusion network obtained by the x-1-th round of video classification training to obtain classification results;

Generating a first target loss value according to the classification result and the video type label;

and under the condition that the first target loss value does not meet a first preset convergence condition, adjusting network parameters of a pre-training fusion network used by an X-th round, determining the adjusted pre-training fusion network as a pre-training fusion network used by an x+1th round, and under the condition that the first target loss value meets the first preset convergence condition, determining the pre-training fusion network used by the X-th round as the target pre-training fusion network, wherein X is a positive integer greater than or equal to 1, X is a positive integer greater than or equal to 1 and less than or equal to X, and under the condition that the X takes the value of 1, the pre-training fusion network used by the X-th round is the initial fusion network.

Optionally, the performing the Y-round video recommendation training on the target pre-training fusion network to obtain the target fusion network includes:

executing a Y-th round of video recommendation training in the Y-round of video recommendation training on the target pre-training fusion network through the following steps:

in the video recommendation training of the y-th round, a reference fusion network obtained by the pre-training of the y-1-th round is used for generating the S+1st predicted video of a video viewing sequence sample based on the first S videos of the video viewing sequence sample, wherein the video viewing sequence sample is a known video viewing sequence and is used for recording the playing sequence of videos which are played in the video set by a corresponding user, the video viewing sequence sample comprises W videos, S is a positive integer which is greater than or equal to 1 and less than or equal to W, W is a positive integer which is greater than or equal to 1, and the reference fusion network used by the y-th round is the target pre-training fusion network under the condition that y takes the value of 1;

Generating a second target loss value according to the S+1st predicted video and the S+1st real video of the video watching sequence sample;

and under the condition that the second target loss value does not meet a second preset convergence condition, adjusting network parameters of a reference fusion network used by the y-th round of video recommendation training, determining the adjusted reference fusion network as the reference fusion network used by the y+1th round of video recommendation training, and under the condition that the second target loss value meets the second preset convergence condition, determining the reference fusion network obtained by the y-th round of pre-training as the target fusion network.

According to another embodiment of the present application, there is also provided a video recommendation apparatus including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a video characteristic information set, the video characteristic information set comprises video characteristic information corresponding to each video in the video set, the video characteristic information is used for representing multi-mode fusion characteristics of the corresponding video and relationship characteristics between the corresponding video and other videos in the video set, the relationship characteristics comprise characteristics of the videos in multiple video watching dimensions, and the multi-mode fusion characteristics comprise characteristics of the videos on multiple modes;

The determining module is used for determining target videos to be recommended to the target users in the video set according to the user characteristic information of the target users and the video characteristic information set under the condition that the videos are recommended to the target users in the user set;

and the recommending module is used for recommending the target video to the target user.

According to a further aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described video recommendation method when run.

According to still another aspect of the embodiments of the present application, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the video recommendation method described above through the computer program.

In the embodiment of the application, when a video is required to be recommended to a target user in a user set, a video feature information set is acquired, wherein the video feature information set comprises video feature information corresponding to each video in the video set, each video feature information can represent a multi-mode fusion feature of the corresponding video and a relationship feature between the corresponding video and other videos in the video set, the relationship feature comprises features of the videos in multiple video watching dimensions, the multi-mode fusion feature comprises features of the videos themselves in multiple modes, and then a target video to be recommended to the target user is determined from the video set according to the user feature information and the video feature information set of the target user, and the target video is recommended to the target user. The target video recommended by the method refers to the multimodal fusion characteristics of the target video and the relation characteristics between the target video and other videos in the video set, so that the matching degree of the recommended target video and a target user is higher. By adopting the technical scheme, the problems of low matching degree between the recommended video and the user and the like in the related technology are solved, and the technical effect of improving the matching degree between the recommended video and the user is realized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a schematic view of a hardware environment of a video recommendation method according to an embodiment of the present application;

FIG. 2 is a flow chart of a video recommendation method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a heterogeneous information network according to the related art;

FIG. 4 is a schematic diagram of a video recommendation system according to an embodiment of the present application;

FIG. 5 is a schematic diagram of feature vector fusion into fused feature information according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a target fusion network according to an embodiment of the application;

FIG. 7 is a schematic diagram of a video recommendation process according to an embodiment of the present application;

Fig. 8 is a block diagram of a video recommending apparatus according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method embodiments provided by the embodiments of the present application may be performed in a computer terminal, a device terminal, or a similar computing apparatus. Taking a computer terminal as an example, fig. 1 is a schematic diagram of a hardware environment of a video recommendation method according to an embodiment of the present application. As shown in fig. 1, the computer terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and in one exemplary embodiment, may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the computer terminal described above. For example, a computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than the equivalent functions shown in FIG. 1 or more than the functions shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a video recommendation method in an embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

In this embodiment, a video recommendation method is provided and applied to the computer terminal, and fig. 2 is a flowchart of a video recommendation method according to an embodiment of the present application, as shown in fig. 2, where the flowchart includes the following steps:

step S202, a video feature information set is obtained, wherein the video feature information set comprises video feature information corresponding to each video in the video set, the video feature information is used for representing multi-mode fusion features of the corresponding video and relationship features between the corresponding video and other videos in the video set, the relationship features comprise features of the videos in multiple video watching dimensions, and the multi-mode fusion features comprise features of the video itself in multiple modes;

Step S204, under the condition that video is recommended to a target user in a user set, determining a target video to be recommended to the target user in the video set according to user characteristic information of the target user and the video characteristic information set;

step S206, recommending the target video to the target user.

Through the steps, when the video is required to be recommended to the target user in the user set, a video feature information set is acquired, wherein the video feature information set comprises video feature information corresponding to each video in the video set, each video feature information can represent a multi-mode fusion feature of the corresponding video and a relationship feature between the corresponding video and other videos in the video set, the relationship feature comprises features of the videos in multiple video watching dimensions, the multi-mode fusion feature comprises features of the videos on multiple modes, and further, the target video to be recommended to the target user is determined from the video set according to the user feature information and the video feature information set of the target user, and the target video is recommended to the target user. The target video recommended by the method refers to the multimodal fusion characteristics of the target video and the relation characteristics between the target video and other videos in the video set, so that the matching degree of the recommended target video and a target user is higher. By adopting the technical scheme, the problems of low matching degree between the recommended video and the user and the like in the related technology are solved, and the technical effect of improving the matching degree between the recommended video and the user is realized.

Before describing the video recommendation method according to the present application in detail, the basic symbols used in the present application and the problems to be solved will be described first. Specifically, all vectors and scalars are written in lowercase letters (e.g) And lower case letters (e.g) To represent; using capital letters to represent a matrix (e.g., W) and using capital flower letters to represent a set (e.g., +.>）。

Let theAnd->Representing a user set and a video set, respectively. Every user +.>Are all associated with a video sequence from +.>. Wherein (1)>；/>Indicating that user u has viewed the number of videos. For each video->It all contains visual modality->Audio modality->And text modality->Three modes>In the present application, "modality" may be understood as "media form", e.g. visual modality represents visual media form, audio modality represents audio media form, text modality represents text media form), which are respectively obtained by different feature extraction methods (see below) as visual feature vectors>Audio feature vector->Text feature matrix->. Wherein (1)>Representing a specific modality->The following feature dimensionDegree. In addition, part of the video has a predefined category label and uses a one-hot label vector (one-hot label vector) >And (C) represents the total number of categories.

Formally, the technical problem solved by the present application includes two closely related sub-problems: (1) video multimodal semantic fusion problem; (2) video recommendation problem. The former refers to how to vector visual features of videoAudio feature vector->And text feature vector +.>Fusing rich semantic relations between different videos into a feature vector +.>. The latter is a video predicted to be watched by the user next time step, i.e. the next video recommendation problem. Inputs to the problem include: user set->Video set->And +/per user>Video viewing sequence->. And (3) outputting: predicting video that user u is most likely to access in the next time step +.>And the video is the video which is not accessed before by the user u, namely, the video meets the following condition。

In the solution provided in step S202, taking the video set including 100 videos, where the user set includes 3 users as an example, the video feature information set includes 100 video feature information, where each video feature information is used to represent a multimodal fusion feature of a corresponding video, and the corresponding video has a semantic relationship with other videos in the video set, and similarly, the 3 user feature information set includes 3 user feature information, where each user feature information is used to represent, by using video feature information corresponding to a video that is played by a corresponding user, a feature of a video that is preferred by the corresponding user, for example, user a of 3 users, where user a has played 20 videos of 100 videos in the video set, where 20 videos (which can be understood as a video viewing sequence of user a) can find the corresponding 20 video feature information in the video feature information set, and then the 20 video feature information here can be used to generate user feature information of user a to represent the feature of the video that is preferred by user a.

Optionally, in this embodiment, the multi-mode fusion feature is a fusion feature of features of a video on multiple modes, for example, for a video a, where the video a has a visual feature in a visual mode, the video has an audio feature in an audio mode, the video has a text feature in a text mode, and the video a is fused with the visual feature, the audio feature and the text feature corresponding to the video a to obtain a multi-mode fusion feature corresponding to the video a, where the video a may have a feature fusion of other modes (including the visual mode, the audio mode and the text mode) except for the multiple modes in the above 3, so as to obtain the multi-mode fusion feature.

In one exemplary embodiment, the set of video feature information may be acquired, but is not limited to, by: extracting features of semantic edges from a video multi-modal semantic graph of the video set as the relationship features, and acquiring the multi-modal fusion features of each video in the video set to obtain a fusion feature information set, wherein the video multi-modal semantic graph is used for displaying the relationship features between videos in the video set in the form of video vertices and the semantic edges, each video vertex represents one video, and each semantic edge represents one relationship feature; and adding the relation features to the fusion feature information set to obtain the video feature information set.

Optionally, in this embodiment, before introducing the concept of the video multimodal semantic graph, the description of the related concepts and definitions related to the present application is needed:

definition 1: heterogeneous Information Networks (HIN) are a network structure that represents and handles different types of vertices and edges. Unlike conventional homogeneous information networks, where vertices and edges are homogeneous, vertices and edges in HIN may be of different types and attributes, often used to describe multiple entities and complex associations between entities in the real world. Generally, HIN is modeled as a quad. Wherein (1)>And->Representing a set of vertices and edges, respectively; />Andrepresenting vertex type mapping functions and edge type mapping functions, respectively. Here, the->And->Respectively representing a vertex type set and a side type set, and satisfying +.>。

Definition 2: meta-Path (Meta-Path), for a givenOne length isMeta-path->Has the following form: />(can be abbreviated as->) Wherein->And->Respectively indicate->A particular vertex type and a particular edge type.

For a given setOne element path->May correspond to several specific paths, referred to as path instances.

Definition 3: the multimodal heterogeneous information network (Multimodal Heterogeneous Information Networks, MHIN) is a heterogeneous information network representing different types of objects and their semantic relationships in a video recommendation system, typically modeled as Wherein->Including all users->Video frequencyCategory->And tag->Thus there is，/>Consists of different semantic connection edges between four types of objects.

The core aim of the application is to learn the video embedding of multi-mode semantic enhancement by utilizing the multi-mode characteristics of the video and the semantic relation between the videos so as to improve the video recommendation accuracy. Therefore, a group of element paths on the MHIN are designed to mine rich semantic relations among videos, and the MHIN is converted into a homogeneous video multi-mode semantic graph by removing non-video type vertexes in the MHIN and taking the mined semantic relations among videos as edges.

Definition 4: video multimodal semantic graph (VMG) is a homogenous network of information representing Video objects and their semantic relationships, typically modeled asWherein->Representing a video collection; />Representing similarity relation between videos based on meta paths; />Is a set of edge weights.

Based on the above definition, the Heterogeneous Information Network (HIN) in the related art is introduced as follows, and the above mentioned relationships between videos and users are complex for the video collection and the user collection, between videos and users, between users and users, when recommending videos for users, interaction data between users and videos is generally adopted as a recommendation reference, however, besides, the video recommendation system also includes abundant auxiliary data, such as social relationship of the user side, multi-mode information such as vision, audio and text of the video side, and categories and labels of videos. These ancillary information has heterogeneity and complexity and can be generally characterized by heterogeneous information networks modeling different types of entities and associations between them.

Fig. 3 is a schematic diagram of a heterogeneous information network according to the related art, and as shown in fig. 3, it includes four entity types of user (u), video (v), category (t) and tag (c), and four relationships of social relationship, viewing relationship, category attribution and tag label among the entities. More hidden relations between videos can be mined from the existing relations. For example, if the userAnd->All watch video +.>Second order connectivity->Behavior similarity between two users is captured explicitly. Third order connectivity->Representing user +.>It is possible to access video +.>Because of the similar user +.>Has been previously watched->. Thus, this high-order connectivity contained in the user-item bipartite graph encodes rich semantic information of the co-signal. "video-tag-video" according to meta-paths (e.g.)) The same label relationship between two videos can be deduced; "video-user-video" according to meta-path (e.g.)>) A buddy viewing relationship between two videos may be inferred. Therefore, through designing the meta-paths representing different semantics, more hidden relations among different videos are mined from the HIN, the videos are recommended to the user based on the similarity among the videos, and the recommendation accuracy and the user satisfaction are improved. Such an approach is called HIN-based recommendation. The recommendation method is based on the meta-path, and the multi-mode information such as vision, audio and text contained in the video is ignored; (2) The contribution degree of different modal information of the video to interest preference of different users is different. These two shortcomings determine that the HIN-based recommendation method cannot accurately measure the similarity between videos and the preference of users to the videos, so that the recommendation effect is not ideal. For example, in FIG. 3, the meta-path is "video-user-video" (e.g. +. >And) Video->And->Is commonly viewed by a pair of friends and thus may be similar; but->And->However, they belong to the literature love piece and the science fiction action piece respectively, and have obvious differences in the aspects of vision, audio, text and other characteristics, so that the similarity is low. Video->At the same time be user->And->Viewed but->Quilt (S)>Natural, real and profound emotion communication (text) between main angles of man and woman is attracted, and +.>The visual impact (vision) caused by the natural wind and light and heavy personal smell of the greek petersla (Peloponnese) region, 36836, 36902, is preferred.

Unlike HIN-based recommendation methods which focus only on semantic information between videos. The application claims that the multi-mode information of the video and the rich semantic relation between the videos are complementary when the video recommendation is solved, and the video recommendation accuracy can be effectively improved by organically combining the two information.

Therefore, the application also provides a video recommendation system (A Video Recommendation System based on Multi-model security-enhanced Graph Capsule Neural Network, MSGCN, hereinafter referred to as video recommendation system) based on the multi-mode Semantic enhancement map capsule neural network. The video recommendation system can use the video recommendation method provided by the application to recommend videos to users, fig. 4 is a schematic diagram of a video recommendation system according to an embodiment of the application, and as shown in fig. 4, the video recommendation system is composed of a multi-mode information preprocessing module, a multi-mode heterogeneous information network construction module, a meta-path module, a video multi-mode semantic graph construction module, a graph capsule neural network module, a user embedding extraction module and a recommendation module. The multi-mode information preprocessing module is responsible for extracting visual and audio characteristics from the video and cleaning text information; and then, respectively extracting the characteristics of three modes of video, such as vision, audio and text by means of a popular deep learning network. The multi-mode heterogeneous information network construction module is responsible for extracting various entities and semantic relations among the entities in the video recommendation system to construct a multi-mode heterogeneous information network. The meta-path module designs four meta-paths for representing four semantic relationships of the same type (i.e., the same type relationship), the same tag (i.e., the same tag relationship), the same view (i.e., the same view relationship), and the friend's common view (i.e., the friend view relationship) among videos. And executing a random walk algorithm based on a meta-path on the multimodal heterogeneous information network to extract rich semantic relations among videos so as to construct a video multimodal semantic graph. The graph capsule neural network module is responsible for extracting multi-mode semantic enhancement video embedding from the video multi-mode semantic graph, and the embedding not only fuses the characteristics of three modes of video, namely visual, audio and text, but also fuses rich semantic relations among different videos. The user-embedding extraction module extracts multimodal, semantically enhanced user embeddings (i.e., user feature information) from a sequence of video viewing by a user. The recommendation module calculates the probability of watching all videos which are not accessed by the user according to the learned multi-mode semantic enhanced video embedding (namely, video characteristic information) and the user embedding, and returns the video with the highest probability to the user as a recommendation result. The recommendation method can greatly improve the accuracy of video multi-mode embedding and further improve the accuracy of video recommendation by comprehensively considering the visual, audio and text three-mode characteristics of the video and the rich semantic relation among the three-mode characteristics. In addition, the provided network adopts a training mode of pretraining and fine tuning, so that the dependence on the number of marked samples can be greatly reduced, and the network training efficiency is improved.

In an exemplary embodiment, the characteristics of the semantic edges can be extracted from the video multi-mode semantic graphs of the video set as the relationship characteristics, and the multi-mode fusion characteristics of each video in the video set are obtained to obtain a fusion characteristic information set by the following manners: converting the video multi-mode semantic graph into a target video adjacency matrix, and obtaining the relation features according to the similarity degree of features between any two video vertexes in the video multi-mode semantic graph represented by the target video adjacency matrix in a plurality of video watching dimensions; and fusing the characteristics of each video in the video set on a plurality of modes into fused characteristic information to obtain the fused characteristic information set, wherein the fused characteristic information is used for representing the multi-mode fused characteristics of the corresponding video.

Optionally, in this embodiment, a detailed process of converting the video multimodal semantic graph into the target video adjacency matrix is described as follows:

to construct aAnd->The application designs four-element path +.>、/>、/>、The method is used for representing four semantic relations, namely a relation of the same type, a relation of the same label, a relation of the same view and a relation of the friend view among videos. Designing a VMG adjacent matrix random walk construction algorithm based on a meta-path, executing the random walk algorithm based on the meta-path on MHIN to extract a series of paths (contexts) with specific length, calculating co-occurrence frequency of any two video pairs by randomly sampling the paths, taking the frequency as the similarity of the video pairs based on the meta-path, and finally obtaining a video adjacent matrix- >. Wherein, when element->When it is, video +.>And->From the edgeAre connected and are edge->Weight of +.>The method comprises the steps of carrying out a first treatment on the surface of the When->When indicate +.>And->There are no edge links. Thus, by video adjacency matrix->Edge set +.>And set of edge weights->Here video adjacency matrix->Namely the target video adjacency matrix, video adjacency matrix +.>Edge set of->For representing said semantic relationship between said video vertices on video type, video label and viewing user, set of edge weights +.>For representing the degree of association between videos connected by each semantic relationship, and further obtaining a video multi-modal semantic graph ++>. Given MHIN and Meta Path set +.>The VMG video adjacency matrix random walk construction algorithm (BVMG) based on the meta-path comprises the following specific steps:

step 1: initializing vertex-context co-occurrence matricesAll elements of the method are set to be zero;

step 2: from meta-path collectionFetch an unused meta-path +.>The computation is based on meta-path->Is a single step transition probability matrix of a restarting random walk>. A random walker is set at the moment +.>Located on MHIN and numberedIs not restricted by the vertex of (1)>And satisfy->. Then next time step->It moves to +. >The probability of (2) is +.>：

（1）

Wherein,representing vertex +.>All types are->Is a set of neighbors of a group (a). Repeatedly calculating transition probability from each vertex to all adjacent vertices to obtain +.>。/>

Step 3: for a pair ofVideo vertex set->Any vertex +.>To->For root apex, in->The upper launch restart probability is->The transition probability matrix is->The path length isIs a random walk of (1); repeat->The times, get m pieces of length +.>Is>The method comprises the steps of carrying out a first treatment on the surface of the Each path is a vertex->Is a context ctx of (1); record->The m paths set of (2) is +.>；

Step 4: video vertex setAny vertex +.>Path set of->Is +.>Implementation window size is +.>Is to randomly sample a pair of vertices at a time, sampleGet a list of all vertex pairs next +.>The method comprises the steps of carrying out a first treatment on the surface of the For each vertex pairUpdating the element +_in the vertex-context co-occurrence matrix (which can be understood as the initial context co-occurrence matrix)>Andis the value of (1): />；

Step 5: repeat Step2-Step4 until the meta-path setIs fetched.

Step 6: calculation from vertex-context co-occurrence matrix O (which can be understood as target context co-occurrence matrix)Vertex pointPresence context->Probability of- >And its edge probability->And->：

（2）

VMG video adjacency matrixIs->The value of (2) can be calculated by the following formula:

（3）

in one exemplary embodiment, the video multimodal semantic graph may be converted to a target video adjacency matrix by, but is not limited to, the following: acquiring a transition probability matrix corresponding to the video multi-mode semantic graph, wherein the transition probability matrix is used for indicating the transition probability of a random walker from one video vertex to each adjacent video vertex in the process of using a random walk construction algorithm to walk the video multi-mode semantic graph; taking an ith video vertex in M video vertices in the video multi-mode semantic graph as a root vertex, taking the root vertex as a starting point of a random walk construction algorithm, expanding random walks with preset path length for P times according to a meta-path set, preset restart probability and the transition probability matrix to obtain Q long walk paths corresponding to the ith video vertex, wherein the Q long walk paths form a context of the ith video vertex, the meta-path set comprises meta-paths used for representing relation characteristics between each video and other videos in the video set, the restart probability is used for indicating the probability of each step of the random walk to jump back to the starting point in the process of each random walk, i is a positive integer greater than or equal to 1 and less than or equal to M, and Q is a positive integer less than or equal to P; sampling the size of a preset window for the Q long-distance walking paths in sequence to obtain an ith vertex pair list of the ith video vertex, wherein when each sampling is recorded in the ith vertex pair list, a pair of video vertices at two ends of the sampling are sampled, and the length of the size of the preset window is more than 2 and less than the length of the preset path; and generating the target video adjacency matrix according to M vertex pair lists corresponding to the M video vertices.

Optionally, in this embodiment, a transition probability matrix corresponding to the video multi-mode semantic graph is obtained, where the transition probability matrix may be understood as the transition probability matrix；/>

Optionally, in this embodiment, taking an ith video vertex of M video vertices in the video multi-mode semantic graph as a root vertex, taking the root vertex as a starting point of a random walk construction algorithm, and expanding a random walk with a preset path length for P times according to a meta-path set, a preset restart probability and the transition probability matrix to obtain Q long walk paths corresponding to the ith video vertex, which may be understood as Step 3 (Step 3), where the ith video vertex is taken as the root vertex, i.e., an arbitrary vertexTo->For the root vertex, the meta-path set is +.>The preset restart probability is +.>Transition probability matrix->Develop a preset path length (preset path length takes a value) Random walk P (P takes the value +.>) And twice. m pieces of length->Is>Corresponding to the Q long walking paths (Q takes the value m); each path is a vertex->Is a context ctx of (1); record->The m paths set of (2) is +.>。

Optionally, in this embodiment, sampling the preset window sizes of the Q long walking paths sequentially to obtain an ith vertex pair list of the ith video vertex, where the ith vertex pair list records a pair of video vertices at two sampling ends when each sampling is performed on the ith vertex pair list, and the length of the preset window size is greater than 2 and less than the preset path length, which may be understood as Step 4 above, and sequentially performing sampling pairs of the preset window sizes on the Q long walking paths Coping with video vertex setsAny vertex +.>Path set of->Is +.>(i.e. long walk path) implementation window size +.>(i.e., a preset window size, the length of the preset window size is greater than 2 and less than the preset path length). Vertex pair list, i.e. list->。

In one exemplary embodiment, the root vertices may be used as a starting point of a random walk construction algorithm by, but not limited to, expanding a random walk of a preset path length P times according to a set of meta paths, a preset restart probability, and the transition probability matrix by: randomly taking a meta-path which does not participate in the random walk from the meta-path set as a walk meta-path, taking the root vertex as a starting point of a random walk construction algorithm, and expanding the random walk with a preset path length for P times according to the walk meta-path, the preset restart probability and the transition probability matrix which are taken out from the meta-path set until all meta-paths in the meta-path set participate in the random walk, wherein the meta-paths in the meta-path set comprise: the video multi-mode semantic graph comprises a first meta-path, a second meta-path, a third meta-path and a fourth meta-path, wherein the first meta-path, the second meta-path, the third meta-path and the fourth meta-path sequentially represent the same type of relation, the same label relation, the same watching relation and the good friend watching relation in the video multi-mode semantic graph respectively, the same type of relation represents that 2 videos are the same, the same label relation represents that the labels of 2 videos are the same, the same watching relation represents that 2 videos are watched by the same user in the user set, and the friend watching relation represents that 2 videos are watched by 1 pair of friends in the user set.

Optionally, in this embodiment, the first meta path, the second meta path, the third meta path, and the fourth meta path respectively correspond to a meta path setFour-element path of->、/>、/>、。

In one exemplary embodiment, the target context co-occurrence matrix may be obtained by, but is not limited to, updating the initial context co-occurrence matrix from the M vertex pair lists by _mq Representing the number of times that an mth video and a q-th video co-occur in the same context, wherein the mth video is a video corresponding to an mth video vertex, the q-th video is a video corresponding to a q-th video vertex, the target context co-occurrence matrix is a symmetric square matrix of M, M and q are positive integers which are greater than or equal to 1 and less than or equal to M; and generating the target video adjacency matrix according to the target context co-occurrence matrix.

Optionally, in this embodiment, the initial context co-occurrence matrix is updated according to M vertex pair lists in the M vertex pair lists to obtain the target context co-occurrence matrix, which may be understood as "randomly sampling a pair of vertices each time, sampling in Step 4" described above Get a list of all vertex pairs next +.>The method comprises the steps of carrying out a first treatment on the surface of the For each vertex pairUpdating element +.>And->Is the value of (1): />Up to meta-path set->All paths in the target context co-occurrence matrix are taken out.

Alternatively, in this embodiment, the target video adjacency matrix is generated according to the target context co-occurrence matrix, which may be understood as Step 6 above. Determining a target video adjacency matrix (video adjacency matrix) by a target context co-occurrence matrix (vertex-context co-occurrence matrix O)) Every element->。

In one exemplary embodiment, the target context co-occurrence matrix may be obtained by, but is not limited to, updating the initial context co-occurrence matrix from the M vertex pair list by: obtaining the number of vertex pairs consisting of the (r) th video vertex and the (t) th video vertex from the M vertex pair listsWherein r and t are greater than or equal to 1,and is less than or equal to a positive integer of M, r is not equal to t; element +.>And->Respectively increasing the value of (2)And obtaining the target context co-occurrence matrix, wherein all elements of the initial context co-occurrence matrix are 0.

Optionally, in this embodiment, the number of vertex pairs including the nth video vertex and the nth video vertex is obtained from the M vertex pair listsThen the element in the initial context co-occurrence matrix is +.>And->The values of (2) are respectively increased +.>Obtaining the target context co-occurrence matrix, for example, the number of vertex pairs consisting of the (r) th video vertex and the (t) th video vertex>1, then element +.>And->The number of (c) is increased by 1 respectively,。

in one exemplary embodiment, features of each video in the video set itself over multiple modalities may be fused into fused feature information by, but not limited to: the c fusion characteristic information of a c-th video in M videos included in the video set is obtained through the following steps, wherein c is a positive integer greater than or equal to 1 and less than or equal to M: extracting a visual feature vector of the c-th video, wherein the visual feature vector is used for representing the features of the c-th video under the self visual mode; extracting an audio feature vector of the c-th video, wherein the audio feature vector is used for representing the feature of the c-th video under the audio mode of the c-th video; extracting a text feature vector of the c-th video, wherein the text feature vector is used for representing the feature of the c-th video under the own text mode; and fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video into a c-th target fusion feature vector, and taking the c-th target fusion feature vector as the c-th fusion feature information.

Optionally, in this embodiment, the fused feature information may be obtained by fusing feature vectors under multiple modes by using a multi-mode capsule network, and first, a concept of the capsule network is described: conventional CNN (Convolutional Neural Networks) networks are formed by stacking multiple convolutional layers, each of which is composed of a number of mutually independent neurons. Each neuron uses a single scalar output to summarize the activity of the repetitive feature detectors within the local region; each convolution layer extracts the characteristics of the local area through the convolution kernel, and the view invariance is realized by using a maximum pooling mode. In such architectural designs, the high-level features are a weighted sum of the low-level feature combinations. Although the max pooling operation extracts the most important features of the local region, the relative spatial relationship of the different features is ignored, such that the positional relationship of the high-level features and the low-level features becomes ambiguous. To overcome this deficiency, hinton et al propose a network of capsules, each of which is responsible for identifying a visual entity implicitly defined within a limited range of viewing conditions and deformations, and outputting the probability that it exists within that range and a set of "instance parameters", which may include pose, lighting conditions and deformation information relative to this visual entity. When a visual entity is moved within a limited range, the probability that the entity exists in that area is unchanged, but the strength parameter is "alike". That is, capsule networks (capsule networks) implement encoding spatial information while also calculating the probability of presence of objects. The output of the capsule network may be represented by a vector whose modulus represents the probability that the feature exists and whose direction represents the pose information of the feature.

With reference to the above ideas, in this embodiment, a multi-mode capsule network is provided to fuse multi-mode features of a single video. Specifically, fig. 5 is a schematic diagram of feature vector fusion into fused feature information according to an embodiment of the present application, as shown in fig. 5, a video is shownOne modality->Feature vector +.>And parameter matrix to be learned（/>Is->Matrix dimensions of (a) to obtain a new eigenvector +.>The method comprises the steps of carrying out a first treatment on the surface of the Feature vector +.>Multiplying weights +.>And weighting the input vectorSumming to obtain vector->The method comprises the steps of carrying out a first treatment on the surface of the By non-linear activation functionsVector +.>Conversion to video->Multimodal characterization->。

The above-mentioned multi-modal featuresCorresponding to the c-th fusion characteristic information, the using process of the multi-modal capsule network after training is shown, and the multi-modal capsule network can be trained by adopting a multi-modal capsule network dynamic routing algorithm, wherein the multi-modal capsule network dynamic routing algorithm has the following related formulas:

（4）

（5）

（6）

（7）

（9）

（10）

wherein,representing the iteration number; />And->Representing video +.>One modality->Initial coupling coefficient and normalized coupling coefficient of the capsule and the multimode capsule; />Representing a natural exponential function; />Representing the modular operation of the vector. />Representing a nonlinear activation function.

Specifically, the multi-mode capsule network dynamic routing algorithm comprises the following steps:

step 1: for videoThree modal feature vectors implement a linear transformation +.>（) Obtaining a new modal feature vector +.>；

Step 2: initializing videoThree modality feature vectors->Temporary coupling coefficient with capsule network neurons +.>

Step 3: iteratively performing the following steps for the firstComputing video +.>Three modality feature vectors->Normalized coupling coefficient with capsule network neuron +.>

Step 4: normalized coupling coefficient calculated according to previous stepVideoThree modality feature vectors->Calculating input vector of capsule neural network neuron；

Step 5: the input vector obtained in the previous step is calculated according to the formula (9)Implement->Nonlinear operation, calculating video->Through->Multi-modal fusion feature after multiple iterations；

Step 6: updating the temporary coupling coefficient according to formula (10),；

step 7: when (when)When updating the iteration ordinal->；

Step 8: repeating the steps 3 to 7 untilAnd outputs the video at this time +.>Multimodal fusion feature of->。

In summary, training the multi-modal capsule network through the dynamic routing algorithm of the multi-modal capsule network, namely, weighting the multi-modal capsule network Updating the I round to obtain final weight +.>（/>) The trained multi-mode capsule network is obtained, and the trained multi-mode capsule network can be used for outputting the video +_in the moment>Target fusion feature vector +.>As fusion characteristic information->。

Optionally, in this embodiment, the multi-modal capsule network shown in fig. 5 includes: visual capsules, audio capsules, text capsules and multimodal capsules, wherein feature vectors in the visual capsulesRepresenting visual feature vector, feature vector +.>Representing the audio feature vector, feature vector +.>Representing text feature vectors,/->Feature vectors are fused for the object.

In one exemplary embodiment, the visual feature vector of the c-th video may be extracted, but is not limited to, by: sampling the c-th video by adopting a first preset time interval sampling mode to obtain the videoCorresponding to the c-th videoA frame picture; said->Inputting each frame of picture into an image feature extraction model to obtain +.f outputted by the image feature extraction model>Picture feature vectors; according to said->A picture feature vector generates the visual feature vector of the c-th video.

Alternatively, in this embodiment, the extracting the visual feature vector of the c-th video may, but is not limited to, by:extracting ++in equal time interval sampling mode by means of FFmpeg tool software>Frame pictures forming a key frame sequence->。/>The content features are extracted using a pre-trained ResNet-152 (a deep convolutional neural network model) neural network on the ImageNet dataset. Specifically, first +.>Is randomly cut to 224 x 224 and input to the ResMNet-152 for feature extraction to obtain a +.>(taking the 2048-dimensional visual feature vector +.>. Finally, the +/for each video>Averaging the individual visual feature vectors to obtain the final visual feature vector +.>。

In one exemplary embodiment, the audio feature vector of the c-th video may be extracted, but is not limited to, by: extracting audio mode data of the c-th video; dividing the audio mode data into the audio mode data according to a second preset time interval based on the time dimensionSegment audio modality data; said->Inputting each segment of sub-audio mode data in the segment of sub-audio mode data into an audio feature extraction model to obtain +. >Audio segment feature vectors; according to said->An audio segment feature vector generates the audio feature vector for the c-th video.

Alternatively, in this embodiment, the extracting the audio feature vector of the c-th video may, but is not limited to, be performed by:separating complete audio modality data therefrom by means of FFmpeg tool software, equally dividing it in the time dimension into +.>Segments, constituting the audio segment sequence->。/>Using SoundNet pre-trained on ImageNet dataset (one for deep learning model for audio classification and audio understanding) neural network extraction +.>(can take 1024) dimensional audio feature vector +.>. Finally, the +/for each video>Averaging the individual audio feature vectors to obtain the final audio feature vector +.>。

In one exemplary embodiment, the text feature vector of the c-th video may be extracted, but is not limited to, by: extracting the text associated with the c-th video from the text associated with the c-th videoVideo texts; said->Each video text in the video texts is input into a text feature extraction model to obtain +.>A text segment feature vector; according to said- >And generating the text feature vector of the c-th video by using the text segment feature vector.

Optionally, in this embodiment, the text of the c-th video is extractedThe present feature vector may be, but is not limited to, by: the text description of the video includes video titles, video summaries, labels, subtitles, user comments, etc., which may be understood, but are not limited to, as the text associated with the video described above, and the present application focuses primarily on video titles, summaries, and labels.First of all, text data associated with the video>Performing washing to remove characters and stop words which are not matched with the language type, and aligning the text length to be +.>. For word number nw greater than +.>Cut-off text leaving only the first +.>A word; whereas for word numbers nw is smaller than +.>For the text of (>) The "Null" fills. Text data after washing->Can be expressed as +.>Wherein for non-Null words, the patent generates a +.A.A pre-trained GloVe (Global Vectors for Word Representation, global word vector representation) model is used>Word vector of dimension->. Finally, the +/for each video>Averaging the individual text feature vectors to obtain the final text feature vector +. >。

In one exemplary embodiment, the visual feature vector, the audio feature vector, and the text feature vector corresponding to the c-th video may be fused into a c-th target fusion feature vector by, but not limited to: the method comprises the steps that a weight parameter of a capsule network fusion model is subjected to D round adjustment by using a visual feature vector, an audio feature vector and a text feature vector corresponding to a c-th video to obtain a target capsule network fusion model, wherein the weight parameter comprises a visual weight parameter, an audio weight parameter and a text weight parameter, the visual weight parameter is used for indicating the weight of the feature of the video in the visual mode of the video in the process of fusion of the feature by the capsule network fusion model, the audio weight parameter is used for indicating the weight of the feature of the video in the audio mode of the video in the process of fusion of the feature by the capsule network fusion model, the text weight parameter is used for indicating the weight of the feature of the video in the text mode of the video in the process of fusion of the feature by the capsule network fusion model, the capsule network fusion model used for the D-th round is a capsule network fusion model obtained after the D-1 round adjustment of the weight parameter is a preset positive integer greater than or equal to 1, and D is a positive integer less than or equal to D, and when D is taken to 1, the capsule network fusion model is a capsule network model with the weight parameter which is not adjusted initially; and fusing the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video by using the target capsule network fusion model to obtain the c-th target fusion feature vector.

Alternatively, in the present embodiment, the capsule network fusion model may refer to, but is not limited to, the multi-modality illustrated in fig. 5The capsule network uses the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video to carry out D-wheel adjustment on the weight parameters of the capsule network fusion model to obtain a target capsule network fusion model, namely, a multi-mode capsule network dynamic routing algorithm is used for carrying out multi-mode capsule network dynamic routing algorithm on the multi-mode capsule networkAnd carrying out the process of I round update.

Optionally, in this embodiment, the weight parameters include a visual weight parameter, an audio weight parameter, and a text weight parameter, which respectively correspond to those in fig. 5、/>、/>。

Optionally, in this embodiment, the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video are fused by using the target capsule network fusion model to obtain the c-th target fusion feature vector, that is, the target fusion feature vector corresponds to the multi-mode capsule network in fig. 5After I round updating, obtaining a trained multi-modal capsule network (target capsule network fusion model), and outputting multi-modal characteristics by using the multi-modal capsule network>(target fusion feature vector) process.

In one exemplary embodiment, the D-round adjustment of the weight parameters of the capsule network fusion model may be performed using the visual feature vector, the audio feature vector, and the text feature vector corresponding to the c-th video by, but not limited to: the method comprises the following steps of carrying out d-th round adjustment on the weight parameters of a capsule network fusion model by using the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video: determining a capsule network fusion model obtained after the d-1 th round of adjustment of the weight parameters is completed as a capsule network fusion model used for the d round of fusion; inputting the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video into the d-th wheel fusion used capsule network fusion model to obtain a reference fusion feature vector output by the d-th wheel fusion used capsule network fusion model; and adjusting the weight parameters of the capsule network fusion model used in the d-th round of fusion by using the reference fusion feature vector to obtain the capsule network fusion model to be used in the d+1-th round of fusion.

Alternatively, in the present embodiment, the weighting parameters using a multi-modal capsule network dynamic routing algorithm is illustrated in FIG. 5 And carrying out the process of updating the I round, and carrying out the adjustment of the D round on the weight parameters of the capsule network fusion model by using the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video, wherein D can be less than or equal to I but not limited to the D.

In an exemplary embodiment, the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video may be input to the capsule network fusion model for d-th round fusion, to obtain a reference fusion feature vector output by the capsule network fusion model for d-th round fusion, in the following manner: the capsule network fusion model for the d-th round fusion uses the following steps to fuse the visual feature vector, the audio feature vector and the text feature vector corresponding to the c-th video to obtain a reference fusion feature vector output by the capsule network fusion model for the d-th round fusion: respectively carrying out linear transformation on the visual feature vector, the audio feature vector and the text feature vector to obtain corresponding linear visual feature vector, linear audio feature vector and linear text feature vector; performing accumulation and calculation on a weighted visual feature vector, a weighted audio feature vector and a weighted text feature vector to obtain a weighted fusion feature vector, wherein the weighted visual feature vector is obtained by performing outer product operation on a visual weight parameter and a linear visual feature vector of a capsule network fusion model used by the d-th round fusion, the weighted audio feature vector is obtained by performing outer product operation on an audio weight parameter and a linear audio feature vector of a capsule network fusion model used by the d-th round fusion, and the weighted text feature vector is obtained by performing outer product operation on a text weight parameter and a linear text feature vector of a capsule network fusion model used by the d-th round fusion; and converting the weighted fusion feature vector into the reference fusion feature vector by adopting a nonlinear activation function.

Optionally, in this embodiment, the visual feature vector, the audio feature vector, and the text feature vector are respectively subjected to linear transformation to obtain corresponding linear visual feature vectors, linear audio feature vectors, and linear text feature vectors, which correspond to step 1 in the multi-modal capsule network dynamic routing algorithm, for videoThree modal feature vectors implement a linear transformation +.>（/>) Obtaining a new modal feature vector +.>Wherein, the method comprises the steps of, wherein,is a visual feature vector, or an audio feature vector, or a text feature vector.

Optionally, in this embodiment, the weighted visual feature vector is the d-th round fusion useThe weighted audio feature vector is obtained by performing an outer product operation on the audio weight parameter and the linear audio feature vector of the capsule network fusion model used by the d-th round fusion, the weighted text feature vector is obtained by performing an outer product operation on the text weight parameter and the linear text feature vector of the capsule network fusion model used by the d-th round fusion, and the weighted audio feature vector corresponds to the step 2, the step 3 and the step 4 in the multi-modal capsule network dynamic routing algorithm, and the calculation mode of the weighted visual feature vector is taken as an example, and the visual weight parameter of the capsule network fusion model used by the d-th round fusion is known Calculating a normalized coupling coefficient +.>Further to->And linear visual feature vector->And performing the weighted visual characteristic vector obtained by the outer product operation. Similarly, weighted audio feature vectors and weighted text feature vectors may be obtained.

Alternatively, in the present embodiment, the addition and calculation is performed on the weighted visual feature vector, the weighted audio feature vector, and the weighted text feature vector, resulting in a weighted fusion feature vector correspondence formula (7), wherein,representing the weighted fusion feature vector.

Optionally, in this embodiment, the weighted fusion feature vector is converted to the reference fusion feature vector using a nonlinear activation function according to equation (9), wherein,representing referencesFusion feature vector->Representation pair-weighted fusion feature vector>A nonlinear operation is performed.

In one exemplary embodiment, the weight parameters of the capsule network fusion model used for the d-th round of fusion may be adjusted by, but not limited to, using the reference fusion feature vector to obtain the capsule network fusion model to be used for the d+1-th round of fusion: obtaining adjustment parameters, wherein the adjustment parameters comprise visual adjustment parameters, audio adjustment parameters and text adjustment parameters, the visual adjustment parameters are obtained by performing outer product operation on the linear visual feature vector and the weighted fusion feature vector, the audio adjustment parameters are obtained by performing outer product operation on the linear audio feature vector and the weighted fusion feature vector, and the text adjustment parameters are obtained by performing outer product operation on the linear text feature vector and the weighted fusion feature vector; and performing addition operation on the visual weight parameters of the capsule network fusion model used in the d-th round of fusion and the visual adjustment parameters to obtain the visual weight parameters of the capsule network fusion model used in the d+1-th round of fusion, performing addition operation on the audio weight parameters of the capsule network fusion model used in the d-th round of fusion and the audio adjustment parameters to obtain the audio weight parameters of the capsule network fusion model used in the d+1-th round of fusion, and performing addition operation on the text weight parameters of the capsule network fusion model used in the d-th round of fusion and the text adjustment parameters to obtain the text weight parameters of the capsule network fusion model used in the d+1-th round of fusion.

Optionally, in this embodiment, the weight parameter of the capsule network fusion model used in the d-th round of fusion is adjusted by using the reference fusion feature vector to obtain a capsule network fusion model to be used in the d+1-th round of fusion, which corresponds to step 6 in the above-mentioned multi-mode capsule network dynamic routing algorithm, where,representing adjustment parameters->And the visual weight parameters, or the audio adjustment parameters, or the text adjustment parameters of the capsule network fusion model used for the d-th round fusion are represented.

In one exemplary embodiment, the determining the target video to be recommended to the target user in the video set according to the user characteristic information of the target user and the video characteristic information set may include, but is not limited to, the following steps: determining the similarity between each piece of video characteristic information in the video characteristic information set and the user characteristic information; and determining the video corresponding to the video characteristic information with the similarity larger than the target similarity threshold as the target video.

In an exemplary embodiment, before the determining, in the video set, the target video to be recommended to the target user according to the user feature information of the target user and the video feature information set, the following manner may be included, but is not limited to: acquiring nth user characteristic information in the user characteristic information set, wherein the nth user characteristic information is used for indicating the characteristics of videos preferred by the nth user in the user set; acquiring an nth video viewing sequence corresponding to the nth user, wherein the video viewing sequence records the playing sequence of the video of the played video of the corresponding user, and the user set comprises N users, wherein N is a positive integer which is greater than or equal to 1 and less than or equal to N; acquiring the video characteristic information corresponding to each video in the nth video watching sequence from the video characteristic information set to obtain an nth reference video characteristic information set; and merging all the reference video feature information in the nth reference video feature information set into a feature vector to obtain the nth user feature information.

Optionally, in this embodiment, the generating manner of each piece of user characteristic information in the user characteristic information set may beNot limited to the following: the first way is that the thermally unique coded subscriber identity can be simply used as its input feature, and the second way can be used with each subscriberVideo viewing sequence->And the foregoing learned video multimodal semantic enhancement embedding (which may be understood as the video feature information set described above) to learn multimodal semantic enhancement user embedding (which may be understood as the user feature information described above). Do not hinder and do->Is embedded into a set ofAll of them are directly obtained by index from the video multi-modal semantic enhancement embedment (video feature information set) learned by the video numbering.

Optionally, in this embodiment, merging all the reference video feature information in the nth reference video feature information set into one feature vector to obtain the nth user feature information may be, but is not limited to, implemented by a user vertex capsule:

first design user vertex capsule functionFor the userVideo viewing sequence->Performing truncated length compensation to ensure that the video viewing sequence length of each user is the same, and the video viewing sequence length is +. >. Specifically, when->At the time, from->Get the nearest->Video viewing sequence of individual videos as user +.>The method comprises the steps of carrying out a first treatment on the surface of the When->When (I)>The left side is supplemented with length and +.>Is repeated in the earliest video of the group (+)>) Secondary, and insert->To the left of (a). For the processed user video viewing sequence +.>By means of user vertex capsule->Weighted combination is performed to output +.>，/>Representing a user characteristic information +.>The transformation is performed to finally obtain the embedding +.>，/>Representing a set of user characteristic information.

In one exemplary embodiment, the video feature information set may be obtained by, but is not limited to, adding the relationship feature to the fused feature information set by: inputting a target video adjacency matrix and the fusion characteristic information set into a target fusion network to obtain the video characteristic information set output by the target fusion network, wherein the target video adjacency matrix is used for representing the relationship characteristics among video vertices on video types, video labels and watching users and the association degree among videos connected by each relationship characteristic, and the target fusion network is used for updating each fusion characteristic information in the input fusion characteristic information set into corresponding video characteristic information according to the relationship characteristics represented by the target video adjacency matrix to obtain the video characteristic information set.

Optionally, in this embodiment, the semantic relationship may be, but is not limited to, added to the fusion feature information set through a target fusion network to obtain the video feature information set, fig. 6 is a schematic diagram of a target fusion network according to an embodiment of the present application, and as shown in fig. 6, a target video adjacency matrix (a) and the fusion feature information set (H) are input to the target fusion network to obtain the video feature information set output by the target fusion network）。

In one exemplary embodiment, the target fusion network may include, but is not limited to: input layer and L layer picture capsule convolution layers, the first layer picture capsule convolution layer in L layer picture capsule convolution layer includes basic video vertex capsule, the rest of the L layer picture capsule convolution layers include advanced video vertex capsule, and the L layer picture capsule in L layer picture capsule convolution layer convolvesThe layers further comprise final video vertex capsules, the basic video vertex capsules are used for executing first convolution operation on each fusion characteristic information in the fusion characteristic information set according to the target video adjacent matrix to obtain a convolution fusion characteristic vector set output by a first layer of graph capsule convolution layers, and the advanced video vertex capsules in the first layer of graph capsule convolution layers are used for executing second convolution operation on the received convolution fusion characteristic vectors according to the target video adjacent matrix to obtain the first layer of graph capsule convolution layers The convolution fusion feature vector input by the advanced video vertex capsule in the layer graph capsule convolution layer is used for executing a third convolution operation on the convolution fusion feature vector output by the advanced video vertex capsule in the layer graph capsule convolution layer, and the third convolution operation is used for aggregating the convolution fusion feature vector and outputting the video feature information set.

Alternatively, in this embodiment, as shown in fig. 6, the target fusion network is a graph capsule neural network MSGCN, which is composed of an input layer andthe layer diagram capsule is composed of a convolution layer. The first layer of graph capsule convolution layer comprises a basic video vertex capsule, and GNN is applied to extract local vertex characteristics with different receptive fields; all the picture capsule convolution layers except the first layer of picture capsule convolution layer comprise advanced video vertex capsules for ensuring the +.>Layer diagram capsule convolution layer output->The feature dimension is always +.>. In addition, the final video vertex capsule is also included in the final layer of the picture capsule convolution layer for adding +.>Layer diagram capsule convolution layer output->The feature dimension of (2) always becomes +.>. The following are specific designs of three capsules:

(1) Basic video vertex capsule. Multi-modal semantic graph in video In, consider +.>And its neighbor vertex set +.>To simplify the discussion, a specific definition of +.>Contained in its neighbor vertex set, i.e. there is +.>. For having->The conventional neural network of the layer-by-layer graph roll layer can be mentioned as the +.>The layer picture volume is laminated about->Is to do the graph rolling operation of (a). The function is +.>And all its neighbors->No. of multimodal fusion feature>The dimension characteristic value is used as input, and a new scalar ++is output after the calculation of the graph convolution>：

（11）

Wherein,representing video +.>And->The weight of the connecting edge between them, i.e. +.>And->The similarity between the two is obtained by solving a VMG adjacency matrix random walk construction algorithm based on a meta-path. Video->Is>After the graph convolution operation defined by the formula (11) is performed on the dimension input feature value, the +.>Novel multimodal fusion profile->Then implementing linear transformation and nonlinearityAfter the activation operation, output->：

（12）

Wherein,is a weight parameter to be learned; />A nonlinear activation function such as the Relu function.

In order to capture more local information between a video and neighbors thereof, the patent designs a basic video vertex capsule based on a video multi-mode characteristic random variable higher-order statistical moment on the basis of traditional graph convolution operation Packaging these local information in so-called instantiation parameters, forming an informative basic video vertex capsule +.>：

（13）

Wherein,representing the highest order of the random variable statistical moment of the multi-modal feature of the video; />And->Representing video +.>And all neighbors of +.>No. of multimodal fusion feature>Mean and variance of the dimensional eigenvalues. Similarly, pair->Is>After the convolution operation of the map capsule defined by the formula (13) is performed on the dimension input eigenvalue, the +.>Novel multimodal fusion profile->. Thus, for a vertex feature matrix composed of all video multi-modality fusion featuresThe first layer of the designed graph capsule network will produce the output +.>. Wherein (1)>A graph capsule convolution operation representing the first layer, which is characterized by video multi-modal fusion +.>And->Adjacency matrix->For input, is a matrix representation of the graph capsule convolution operation defined by equation (13). It is not difficult to find that withThe increasing values will cause the output characteristic dimension of the subsequent capsule convolution layer to increase rapidly or even be too large to handle.

(2) Advanced video vertex capsule. For this purpose, it is proposed to limitLayer diagram capsule convolution layer outputThe feature dimension is always +.>. This can be achieved by: in the case of the map capsule network- >Input received by the layer->For each video->Is>Through basic video vertex capsule function->Vectorizing and outputting->. Designing advanced video vertex capsule functionsFix->The first two dimensions (not to be restricted to be separately notedAnd->Respectively represent video +.>And (4) th->Dimensional features), for a total of q capsules in the next two dimensionsWeighted combination is performed to output +.>：

（14）

（15）

Wherein,indicating capsule->With capsule->The coupling coefficient between the two is calculated by the dynamic routing algorithm; />Is the weight parameter to be learned. For each videoEach of the dimensional characteristics of (2) is subjected to the transformations defined by formulas (14) to (15), and an output is finally obtained. Thus, in short, the graph capsule network is at +.>The layer receives two inputsAnd->And generates an output +.>. When->Output video multimodal feature at this time>。

(3) And finally, video vertex capsules. Similarly, advanced video vertex capsule functions are designed in the manner specified by equations (14) - (15)Fix->First dimension, pair->Rear two-dimensional co-ordinatesCapsule->Implementing weighted combination, output->The method comprises the steps of carrying out a first treatment on the surface of the For every video->All the transformation is implemented, and finally all the multi-mode semantic enhancement embedding of the video is obtained>。

In an exemplary embodiment, before the target video adjacency matrix and the fusion feature information set are input into a target fusion network to obtain the video feature information set output by the target fusion network, the method may, but is not limited to, further include the following manners: acquiring an initial fusion network; performing X-round video classification training on the initial fusion network to obtain a target pre-training fusion network, wherein X is a positive integer greater than or equal to 1, and the accuracy of the target pre-training fusion network on video classification is greater than the target accuracy; and performing Y-round video recommendation training on the target pre-training fusion network to obtain the target fusion network, wherein Y is a positive integer greater than or equal to 1.

Optionally, in this embodiment, the training method of the target fusion network includes a pre-training stage and a fine-tuning stage, and considering that a part of videos have category label information, a video classification task is designed as an auxiliary task to implement pre-training on the target fusion network.

In one exemplary embodiment, the initial fusion network may be, but is not limited to, subjected to an X-round video classification training to obtain a target pre-training fusion network by: executing an X-th round of video recommendation training in the X-round of video classification training on the initial fusion network through the following steps: in the x-th round of video recommendation training, classifying the video samples marked with the video type labels by using a pre-training fusion network obtained by the x-1-th round of video classification training to obtain classification results; generating a first target loss value according to the classification result and the video type label; and under the condition that the first target loss value does not meet a first preset convergence condition, adjusting network parameters of a pre-training fusion network used by an X-th round, determining the adjusted pre-training fusion network as a pre-training fusion network used by an x+1th round, and under the condition that the first target loss value meets the first preset convergence condition, determining the pre-training fusion network used by the X-th round as the target pre-training fusion network, wherein X is a positive integer greater than or equal to 1, X is a positive integer greater than or equal to 1 and less than or equal to X, and under the condition that the X takes the value of 1, the pre-training fusion network used by the X-th round is the initial fusion network.

Optionally, in this embodiment, the pre-training phase, specifically, includes the following procedure:

video is processedIs embedded with multi-modal semantic enhancements->(video sample labeled with video type tag) probability distribution input to classifier predictor class tag +.>：

（17）

Learned by constraintCategory label probability distribution of (2) and its true label +.>Similarly, design a supervised learning based pre-training loss function +.>The following are provided:

（18）

the proposed MSGCN network is pre-trained to optimize the loss function value according to a specific strategy such as random gradient descent (Stochastic Gradient Descent, SGD), momentum gradient descent (Momentum Gradient Descent, MGD), nesterov Momentum, adaGrad, RMSprop and Adam (Adaptive Moment Estimation) or batch gradient descent (Batch Gradient Descent, BGD) until the loss function achieves a minimum or the number of exercises reaches a specified iteration maximum, and the pre-training is ended. The network parameters are frozen. And then taking video recommendation as a main task to fine tune the pre-trained MSGCN network parameters.

In one exemplary embodiment, the target fusion network may be obtained by, but is not limited to, performing a Y-round video recommendation training on the target pre-training fusion network by: executing a Y-th round of video recommendation training in the Y-round of video recommendation training on the target pre-training fusion network through the following steps: in the video recommendation training of the y-th round, a reference fusion network obtained by the pre-training of the y-1-th round is used for generating the S+1st predicted video of a video viewing sequence sample based on the first S videos of the video viewing sequence sample, wherein the video viewing sequence sample is a known video viewing sequence and is used for recording the playing sequence of videos which are played in the video set by a corresponding user, the video viewing sequence sample comprises W videos, S is a positive integer which is greater than or equal to 1 and less than or equal to W, W is a positive integer which is greater than or equal to 1, and the reference fusion network used by the y-th round is the target pre-training fusion network under the condition that y takes the value of 1; generating a second target loss value according to the S+1st predicted video and the S+1st real video of the video watching sequence sample; and under the condition that the second target loss value does not meet a second preset convergence condition, adjusting network parameters of a reference fusion network used by the y-th round of video recommendation training, determining the adjusted reference fusion network as the reference fusion network used by the y+1th round of video recommendation training, and under the condition that the second target loss value meets the second preset convergence condition, determining the reference fusion network obtained by the y-th round of pre-training as the target fusion network.

Optionally, in this embodiment, the fine tuning stage, in particular, includes the following processes:

based onNegative log likelihood function defines a loss function for a video recommendation taskTaking a video recommendation task as a main task, and utilizing a negative log likelihood function to enable a loss function of the MSGCN network to be ∈>The definition is as follows:

（19）

wherein,indicating that the video belongs to a video in the video collection that has not been played by the target user,video viewing sequences for target users.

The proposed pre-trained MSGCN network parameters are modified and updated to optimize the loss function value according to specific strategies such as random gradient descent (Stochastic Gradient Descent, SGD), momentum gradient descent (Momentum Gradient Descent, MGD), nesterov Momentum, adaGrad, RMSprop and Adam (Adaptive Moment Estimation) or batch gradient descent (Batch Gradient Descent, BGD) until the loss function achieves a minimum or the number of exercises reaches a specified iteration maximum. And the S+1st predicted video recommended by the S previous videos is consistent with the S+1st real video of the video watching sequence sample by the trained target fusion network.

In the technical solution provided in step S204, in the case of recommending a video for a target user in the user set, for example, user a, the target user feature information a corresponding to the target user is obtained from the user feature information set, and the video feature information to be played corresponding to the video to be played is obtained from the video feature information, so as to obtain a video feature information set to be played, where the video to be played is a video that is not played by the user a in the video set.

In the technical solution provided in step S206, according to the userVideo viewing sequence of (a)I.e. the user is at time step->Watch video->、/>Watch video->In this way push up to +.>The method comprises the steps of carrying out a first treatment on the surface of the From->The unobserved video set +.>Selecting one video with highest viewing probability +.>As a user +.>In the next time step +.>Most likely video to be accessed. For video->User->The probability of viewing it in the next time step is:

（16）

wherein,for the target user characteristic information +.>And the video feature information is any one of the video feature information sets to be played.

In order to better understand the process of recommending the video, the following description is given in connection with the alternative embodiment, but the description is not limited to the technical solution of the embodiment of the present application.

In this embodiment, a video recommendation method is provided, and fig. 7 is a schematic diagram of a video recommendation flow according to an embodiment of the present application, as shown in fig. 7, mainly including the following steps:

step S701: video datasets are collected and collated. The video recommendation dataset Ω was collected and pre-processed, containing 40049 micro videos and 1935 different tags. Dividing the training set train, the verification set, the Valid and the test set Vtest according to the proportion of 60% (24029 videos), 20% (i.e., 8010 videos) and 20% (i.e., 8010 videos);

Step S702: multimodal information preprocessing. Extracting visual feature vectors, audio feature vectors and text feature vectors of each video Vj according to the visual, audio and text feature extraction method introduced by the medium-multi-mode information preprocessing module;

step S703: video recommendation system graph modeling. Extracting entities and relations thereof from the video recommendation system, and constructing a heterogeneous information network of the video recommendation system;

step S704: and constructing a video multi-mode semantic graph. Designing a meta-path, extracting rich semantic relations among videos from a heterogeneous information network according to a BVMG algorithm, and constructing a video multi-mode semantic graph;

step S705: the multi-modal capsule network is designed to fuse the multi-modal characteristics of a single video. Designing a multi-mode capsule network to fuse multi-mode characteristics of a single video;

step S706: the design drawing capsule neural network aggregates the multi-modal characteristics of different videos. The design drawing capsule neural network aggregates the multi-modal characteristics of different videos;

step S707: learning multi-modal semantics enhances user embedding. Designing a user vertex capsule network to extract multi-mode semantic enhancement user embedding;

step S708: and constructing a network model and designing a network loss function. And respectively constructing a multi-modal capsule network, a graph capsule neural network and a user vertex capsule network to form the multi-modal semantic enhancement graph capsule neural network. Designing a network loss function according to formulas (17) to (19);

Step S709: the network model is initialized and trained. The parameters of each layer of the MSGCN network are initialized according to a specific strategy such as normal distribution random initialization, xavier initialization or Heinitial initialization. Pre-training and fine-tuning the network;

step S710: video recommendation. For each user u, from the set of videos that the user u has not watched, one video v with the highest viewing probability is calculated and selected according to the formula (16), and recommended to the user.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the various embodiments of the present application.

FIG. 8 is a block diagram of a video recommender in accordance with an embodiment of the present application; as shown in fig. 8, includes:

a first obtaining module 802, configured to obtain a set of video feature information, where the set of video feature information includes video feature information corresponding to each video in the set of videos, where the video feature information is used to characterize a multimodal fusion feature of the corresponding video and a relationship feature between the corresponding video and other videos in the set of videos, where the relationship feature includes features between videos in multiple video viewing dimensions, and the multimodal fusion feature includes features of the video itself in multiple modalities;

a determining module 804, configured to determine, in a case where a video is recommended to a target user in a user set, a target video to be recommended to the target user in the video set according to user feature information of the target user and the video feature information set;

and a recommending module 806, configured to recommend the target video to the target user.

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.

Through the embodiment, when the video is required to be recommended to the target user in the user set, the video feature information set is obtained, the video feature information set includes video feature information corresponding to each video in the video set, wherein each video feature information can represent a multi-mode fusion feature of the corresponding video and a relationship feature between the corresponding video and other videos in the video set, the relationship feature includes features of the videos in multiple video watching dimensions, the multi-mode fusion feature includes features of the video itself in multiple modes, and then the target video to be recommended to the target user is determined from the video set according to the user feature information and the video feature information set of the target user, and is recommended to the target user. The target video recommended by the method refers to the multimodal fusion characteristics of the target video and the relation characteristics between the target video and other videos in the video set, so that the matching degree of the recommended target video and a target user is higher. By adopting the technical scheme, the problems of low matching degree between the recommended video and the user and the like in the related technology are solved, and the technical effect of improving the matching degree between the recommended video and the user is realized.

In an exemplary embodiment, the first acquisition module includes:

the extraction unit is used for extracting features of semantic edges from a video multi-modal semantic graph of the video set to serve as the relation features, and acquiring the multi-modal fusion features of each video in the video set to obtain a fusion feature information set, wherein the video multi-modal semantic graph is used for displaying the relation features between the videos in the video set in a form of video vertexes and the semantic edges, each video vertex represents one video, and each semantic edge represents one relation feature;

and the adding unit is used for adding the relation features to the fusion feature information set to obtain the video feature information set.

In an exemplary embodiment, the extraction unit is further configured to:

updating an initial context co-occurrence matrix according to M vertex pair lists to obtain a target context co-occurrence matrix, wherein the element o of the mth row and the q column in the target context co-occurrence matrix _mq Representing the number of times that an mth video and a q-th video co-occur in the same context, wherein the mth video is a video corresponding to an mth video vertex, the q-th video is a video corresponding to a q-th video vertex, the target context co-occurrence matrix is a symmetric square matrix of M, M and q are positive integers which are greater than or equal to 1 and less than or equal to M;

In an exemplary embodiment, the extraction unit is further configured to:

co-occurrence of elements in the initial context matrixAnd->The values of (2) are respectively increased +.>And obtaining the target context co-occurrence matrix, wherein all elements of the initial context co-occurrence matrix are 0.

In an exemplary embodiment, the extraction unit is further configured to:

sampling the c-th video by adopting a first preset time interval sampling mode to obtain a corresponding c-th video A frame picture;

will be spentThe saidInputting each frame of picture into an image feature extraction model to obtain +.f outputted by the image feature extraction model>Picture feature vectors;

In an exemplary embodiment, the extraction unit is further configured to:

extracting audio mode data of the c-th video;

In an exemplary embodiment, the extraction unit is further configured to:

the saidEach video text in the video texts is input into a text feature extraction model to obtain +. >A text segment feature vector;

In an exemplary embodiment, the extraction unit is further configured to:

In one exemplary embodiment, the determining module includes:

a first determining unit, configured to determine a similarity between each piece of video feature information in the video feature information set and the user feature information;

and the second determining unit is used for determining the video corresponding to the video characteristic information with the similarity larger than the target similarity threshold as the target video.

In an exemplary embodiment, the apparatus further comprises:

the second obtaining module is used for obtaining nth user characteristic information in the user characteristic information set before the target video to be recommended to the target user is determined in the video set according to the user characteristic information of the target user and the video characteristic information set, wherein the nth user characteristic information is used for indicating the characteristics of videos preferred by the nth user in the user set;

In an exemplary embodiment, the adding unit is further configured to:

In one exemplary embodiment, the target fusion network includes:

In an exemplary embodiment, the apparatus further comprises:

the third acquisition module is used for acquiring an initial fusion network before the target video adjacency matrix and the fusion characteristic information set are input into a target fusion network to obtain the video characteristic information set output by the target fusion network;

the first training module is used for performing X-round video classification training on the initial fusion network to obtain a target pre-training fusion network, wherein X is a positive integer greater than or equal to 1, and the accuracy rate of the target pre-training fusion network on video classification is greater than the target accuracy rate;

and the second training module is used for executing Y-round video recommendation training on the target pre-training fusion network to obtain the target fusion network, wherein Y is a positive integer greater than or equal to 1.

In one exemplary embodiment, the first training module includes:

the first training unit is used for executing the X-th round of video recommendation training in the X-round of video classification training on the initial fusion network through the following steps:

In one exemplary embodiment, the second training module includes:

the second training unit is used for executing a Y-th round of video recommendation training in the Y-round of video recommendation training on the target pre-training fusion network through the following steps:

Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

In an exemplary embodiment, the electronic device may further include a transmission device connected to the processor, and an input/output device connected to the processor.

Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for recommending video, comprising:

Recommending the target video to the target user.

2. The method of claim 1, wherein the acquiring a set of video feature information comprises:

3. The method according to claim 2, wherein the extracting features of semantic edges from the multi-modal semantic graphs of videos in the video set as the relational features, and acquiring the multi-modal fusion features of each video in the video set to obtain a fusion feature information set includes:

4. The method of claim 3, wherein said converting the video multimodal semantic graph into a target video adjacency matrix comprises:

5. The method of claim 4, wherein the expanding the random walk of the preset path length P times according to the set of meta-paths, the preset restart probability, and the transition probability matrix using the root vertices as a starting point of the random walk construction algorithm comprises:

6. The method of claim 4, wherein generating the target video adjacency matrix from M vertex pair lists corresponding to M video vertices comprises:

7. The method of claim 6, wherein updating the initial context co-occurrence matrix from the M vertex pair list to obtain the target context co-occurrence matrix comprises:

8. A method according to claim 3, wherein the fusing features of each video of the video set itself over multiple modalities into fused feature information comprises:

9. The method of claim 8, wherein the extracting the visual feature vector of the c-th video comprises:

10. The method of claim 8, wherein the extracting the audio feature vector of the c-th video comprises:

extracting audio mode data of the c-th video;

based on the time dimension, according to a second preset timeSpacing divides the audio modality data intoSegment audio modality data;

11. The method of claim 8, wherein the extracting the text feature vector of the c-th video comprises:

according to the describedIndividual text segment feature vector generationThe text feature vector of the c-th video.

12. The method of claim 8, wherein the fusing the visual feature vector, the audio feature vector, and the text feature vector for the c-th video to a c-th target fusion feature vector comprises:

13. The method of claim 12, wherein the D-wheel adjustment of the weight parameters of the capsule network fusion model using the visual feature vector, the audio feature vector, and the text feature vector corresponding to the c-th video comprises:

14. The method of claim 13, wherein inputting the visual feature vector, the audio feature vector, and the text feature vector corresponding to the c-th video to the d-th round of fusion used capsule network fusion model to obtain a reference fusion feature vector output by the d-th round of fusion used capsule network fusion model comprises:

15. The method of claim 14, wherein the adjusting the weight parameter of the capsule network fusion model for the d-th round of fusion using the reference fusion feature vector to obtain the capsule network fusion model for the d+1-th round of fusion comprises:

16. The method of claim 1, wherein the determining a target video in the video set to be recommended to the target user based on the user characteristic information of the target user and the video characteristic information set comprises:

17. The method of claim 1, wherein prior to the determining a target video in the video set to be recommended to the target user based on the user characteristic information of the target user and the video characteristic information set, the method further comprises:

18. The method of claim 2, wherein adding the relationship feature to the set of fusion feature information results in the set of video feature information, comprising:

19. The method of claim 18, wherein the target converged network comprises:

the input layer and the L layer graph capsule convolution layers, a first layer graph capsule convolution layer in the L layer graph capsule convolution layers comprises basic video vertex capsules, the rest of graph capsule convolution layers in the L layer graph capsule convolution layers comprise advanced video vertex capsules, an L layer graph capsule convolution layer in the L layer graph capsule convolution layers further comprises final video vertex capsules, the basic video vertex capsules are used for executing first convolution operation on each fusion feature information in a fusion feature information set according to a target video adjacent matrix to obtain a convolution fusion feature vector set output by the first layer graph capsule convolution layer, the advanced video vertex capsules in the first layer graph capsule convolution layer are used for executing second convolution operation on received convolution fusion feature vectors according to the target video adjacent matrix to obtain a third convolution operationThe convolution fusion feature vector input by the advanced video vertex capsule in the layer graph capsule convolution layer is used for executing the convolution fusion feature vector output by the advanced video vertex capsule in the layer graph capsule convolution layerAnd the third convolution operation is used for aggregating convolution fusion feature vectors and outputting the video feature information set.

20. The method of claim 18, wherein prior to said inputting the target video adjacency matrix and the set of fusion feature information into a target fusion network to obtain the set of video feature information output by the target fusion network, the method further comprises:

acquiring an initial fusion network;

21. The method of claim 20, wherein performing the X-round video classification training on the initial fusion network results in a target pre-training fusion network, comprising:

22. The method of claim 20, wherein performing the Y-round video recommendation training on the target pre-training fusion network results in the target fusion network, comprising:

23. A video recommendation device, comprising:

24. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run performs the method of any one of claims 1 to 22.

25. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to perform the method of any of claims 1 to 22 by means of the computer program.