CN111400601A

CN111400601A - Video recommendation method and related equipment

Info

Publication number: CN111400601A
Application number: CN202010193915.4A
Authority: CN
Inventors: 屈冰欣; 郑茂
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-16
Filing date: 2019-09-16
Publication date: 2020-07-10
Anticipated expiration: 2039-09-16
Also published as: CN110609955B; CN111400601B; CN110609955A

Abstract

The embodiment of the application provides a video recommendation method and related equipment, which can improve semantic representation of videos, so that videos more fitting user interests are recommended for users. Determining a target image feature vector; determining a target text feature vector; determining a target audio feature vector; obtaining semantic representation of the target video through a target semantic conversion model based on the target image feature vector, the target text feature vector and the target audio feature vector, wherein the target semantic conversion model corresponds to the target image feature, the target text feature and the target audio feature; and when the semantic representation of the target video is matched with the interest picture of the target object, pushing the target video to the target object.

Description

Video recommendation method and related equipment

The application requires divisional application of patent application with application number 201910872376.4 entitled "a video recommendation method and related equipment" filed on 16/09/2019.

Technical Field

The present application relates to the field of information processing, and in particular, to a video recommendation method and related device.

Background

The generation and rapid expansion of videos are promoted in the internet era, and with the great increase of the video information amount, users can not obtain the content really interested by themselves from a large amount of videos.

According to the prior art, video representations of all single modes are obtained by training videos in image, audio, text and other single modes independently, and finally, basic splicing is carried out to obtain a final video representation, and then videos are recommended to users according to the video representations.

However, the single modality is used for training alone, and information interaction and communication among modalities are not considered, so that the final video representation effect cannot be well attached to the content to be expressed by the video, and further the video recommended to the user is not attached to the interest of the user.

Disclosure of Invention

The application provides a video recommendation method and related equipment, which can improve semantic representation of videos, so that videos more fitting user interests are recommended for users.

A first aspect of an embodiment of the present application provides a method for video recommendation, where the method includes:

determining a target image feature vector, wherein the target image feature vector is an image feature vector corresponding to a target video, the target video is a video to be recommended to a target object, and the target image feature vector comprises channel information and optical flow information of a video image frame corresponding to the target video;

determining a target text feature vector, wherein the target text feature vector is a text feature vector corresponding to the target video, and the target text feature vector comprises title information of the target video and attribute information of audio associated with the target video;

determining a target audio feature vector, wherein the target audio feature vector is an audio feature vector corresponding to the target video;

obtaining semantic representation of the target video through a target semantic conversion model based on the target image feature vector, the target text feature vector and the target audio feature vector, wherein the target semantic conversion model corresponds to the target image feature, the target text feature and the target audio feature;

and when the semantic representation of the target video is matched with the interest picture of the target object, pushing the target video to the target object.

Optionally, the method further comprises:

step 1) determining an image feature vector, a text feature vector and an audio feature vector of each video in a plurality of videos;

step 2) initializing a semantic conversion model;

step 3) obtaining a semantic representation of a first video through the initialized semantic conversion model based on a first image feature vector, a first text feature vector and a first audio feature vector of the first video, wherein the first video is any one of the videos;

step 4), updating a loss function of the semantic conversion model;

and (4) performing iteration from the step 3) to the step 4) until a preset iteration termination condition is reached, and determining the semantic conversion model reaching the preset iteration termination condition as the target semantic conversion model.

Optionally, before obtaining the semantic representation of the first video through the initialized semantic conversion model based on the first image feature vector, the first text feature vector, and the first audio feature vector of the first video, the method further includes:

determining a probability distribution of the first image feature vector, a probability distribution of the first text feature vector, and a probability distribution of the first audio feature vector;

modifying the probability distribution of a first feature vector so as to minimize the relative entropy of the modified probability distribution of the first feature vector and the probability distributions of other feature vectors, wherein the first feature vector is any one of the first image feature vector, the first text feature vector and the first audio feature vector, and the other feature vectors are the feature vectors except the first feature vector in the first image feature vector, the first text feature vector and the first audio feature vector;

the obtaining of the semantic representation of the first video through the initialized semantic conversion model based on the first image feature vector, the first text feature vector and the first audio feature vector of the first video comprises:

and obtaining the semantic representation of the first video through the initialized semantic conversion model based on the first feature vector after the probability distribution is changed.

determining a vector distance between a second feature vector and a third feature vector, wherein the second feature vector and the third feature vector are any two feature vectors of the first image feature vector, the first text feature vector and the first audio feature vector;

migrating the second eigenvector to the vector space of the third eigenvector such that the vector distance between the second eigenvector and the third eigenvector is minimal;

and obtaining the semantic representation of the first video through the initialized semantic conversion model based on the second feature vector and the third feature vector after the migration.

establishing a target semantic space;

migrating a fourth feature vector and a fifth feature vector to the target semantic space, wherein the fourth feature vector and the fifth feature vector are any two feature vectors of the first image feature vector, the first text feature vector and the first audio feature vector;

determining a target feature vector based on the fourth feature vector and the fifth feature vector after migration, wherein the target feature vector is a feature vector in the target semantic space, and the vector distances between the target feature vector and the fourth feature vector and between the target feature vector and the fifth feature vector are both smaller than a preset value;

and obtaining the semantic representation of the first video through the initialized semantic conversion model based on the target feature vector.

Optionally, the method further comprises:

step 1) determining an image feature vector, a text feature vector and an audio feature vector of each of the plurality of videos;

step 2) based on second image characteristics, obtaining a classification result of a second video through a first conversion model, wherein the first conversion model is a model corresponding to the image characteristics of the second video, the second image characteristics are image characteristic vectors corresponding to video image frames of the second video, and the second video is any one of the videos;

step 3) obtaining semantic representation of the second video through a second conversion model based on the classification result of the second video, the second image feature vector, a second text feature vector and a second audio feature vector, wherein the second text feature vector is a text feature vector corresponding to the second video, and the second audio feature vector is an audio feature vector corresponding to the second video;

step 4) updating the weight of the loss function of the second conversion model;

and (4) iteratively executing the step 2) to the step 4) until the preset iteration termination condition is reached, and determining the second conversion model reaching the preset iteration termination condition as the target semantic conversion model.

Optionally, the updating the loss function of the semantic conversion model includes:

updating the weight corresponding to a fourth feature vector in the loss function through a back propagation algorithm, wherein the fourth feature vector is any one feature vector in the first image feature vector, the first text feature vector and the first audio feature vector;

and updating the weight corresponding to the first image feature vector, the weight corresponding to the first text feature vector and the weight corresponding to the first audio feature vector in the loss function through the back propagation algorithm.

Optionally, the method further comprises:

judging whether the iteration times reach a preset value, if so, determining that the preset iteration termination condition is reached;

or the like, or, alternatively,

and judging whether the loss function of the semantic conversion model is converged, if so, determining that the preset iteration termination condition is reached.

A second aspect of the embodiments of the present application provides a video recommendation apparatus, including:

the device comprises a first determining unit, a second determining unit and a third determining unit, wherein the first determining unit is used for determining a target image feature vector, the target image feature vector is an image feature vector corresponding to a target video, the target video is a video to be recommended to a target object, and the target image feature vector comprises channel information and optical flow information of a video image frame corresponding to the target video;

the first determining unit is further configured to determine a target text feature vector, where the target text feature vector is a text feature vector corresponding to the target video, and the target text feature vector includes title information of the target video and attribute information of audio associated with the target video;

the first determining unit is further configured to determine a target audio feature vector, where the target audio feature vector is an audio feature vector corresponding to the target video;

a processing unit, configured to obtain a semantic representation of the target video through a target semantic conversion model based on the target image feature vector, the target text feature vector, and the target audio feature vector, where the target semantic conversion model corresponds to the target image feature, the target text feature, and the target audio feature;

and the pushing unit is used for pushing the target video to the target object when the semantic representation of the target video is matched with the interesting image of the target object.

Optionally, the video recommendation apparatus further includes:

a model training unit specifically configured to:

step 2) initializing a semantic conversion model;

step 4), updating a loss function of the semantic conversion model;

Optionally, the video recommendation apparatus further includes:

a second determining unit configured to determine a probability distribution of the first image feature vector, a probability distribution of the first text feature vector, and a probability distribution of the first audio feature vector;

a feature conversion unit, configured to modify a probability distribution of a first feature vector so as to minimize a relative entropy of the modified probability distribution of the first feature vector and probability distributions of other feature vectors, where the first feature vector is any one of the first image feature vector, the first text feature vector, and the first audio feature vector, and the other feature vectors are feature vectors other than the first feature vector in the first image feature vector, the first text feature vector, and the first audio feature vector;

the model training unit obtains semantic representation of the first video through the initialized semantic conversion model based on a first image feature vector, a first text feature vector and a first audio feature vector of the first video, and the model training unit comprises:

Optionally, the second determining unit is further configured to determine a vector distance between a second feature vector and a third feature vector, where the second feature vector and the third feature vector are any two feature vectors of the first image feature vector, the first text feature vector, and the first audio feature vector;

the feature conversion unit is further configured to migrate the second eigenvector to a vector space of the third eigenvector, so that a vector distance between the second eigenvector and the third eigenvector is minimum;

Optionally, the video recommendation apparatus further includes:

the establishing unit is used for establishing a target semantic space;

the feature conversion unit is further configured to migrate a fourth feature vector and a fifth feature vector to the target semantic space, where the fourth feature vector and the fifth feature vector are any two feature vectors of the first image feature vector, the first text feature vector, and the first audio feature vector;

a third determining unit, configured to determine a target feature vector based on the fourth feature vector and the fifth feature vector after migration, where the target feature vector is a feature vector in the target semantic space, and vector distances between the target feature vector and the fourth feature vector and between the target feature vector and the fifth feature vector are both smaller than a preset value;

Optionally, the model training unit is further configured to:

Optionally, the updating, by the model training unit, the loss function of the semantic conversion model includes:

Optionally, the model training unit is further configured to:

or the like, or, alternatively,

A third aspect of the embodiments of the present application provides a computer apparatus, which includes at least one connected processor, a memory and a transceiver, wherein the memory is used for storing program codes, and the processor is used for calling the program codes in the memory to execute the steps of the video recommendation method in the above aspects.

A fourth aspect of the embodiments of the present application provides a computer storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the steps of the method for video recommendation described in the above aspects.

In summary, it can be seen that, when a target video is recommended to a user, the target video can be pushed to the user according to whether the semantic representation of the target video is matched with the user portrait, and the semantic representation of the target video is obtained through a target semantic conversion model based on an image feature vector, a text feature vector and an audio feature vector of the target video. Therefore, the method and the device have the advantages that the characteristic information of each single mode of the video is integrated, the richness of the final video semantic representation is improved, the obtained video semantic representation is more fit with the theme of the target video, and the video which is more in line with the interest is recommended for the user.

Drawings

Fig. 1 is a schematic diagram of a network architecture provided in an embodiment of the present application:

FIG. 2 is a schematic diagram of an algorithm of multi-modal joint learning provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a method for video recommendation according to an embodiment of the present application;

fig. 4 is a schematic diagram of feature extraction of a target video according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a training process of a target semantic conversion model according to an embodiment of the present disclosure;

FIG. 6 is a schematic flow chart of feature migration provided by embodiments of the present application;

FIG. 7 is another schematic flow chart diagram illustrating feature migration provided by embodiments of the present application;

FIG. 8 is a schematic diagram of an algorithm for model training based on KD concepts according to an embodiment of the present disclosure;

FIG. 9 is another schematic diagram of a training process of a target semantic conversion model according to an embodiment of the present disclosure;

fig. 10 is a schematic virtual structure diagram of an apparatus for video recommendation according to an embodiment of the present application;

fig. 11 is a schematic hardware structure diagram of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprise," "include," and "have," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules expressly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus, the division of modules presented herein is merely a logical division that may be implemented in a practical application in a further manner, such that a plurality of modules may be combined or integrated into another system, or some feature vectors may be omitted, or not implemented, and such that couplings or direct couplings or communicative coupling between each other as shown or discussed may be through some interfaces, indirect couplings or communicative coupling between modules may be electrical or other similar, this application is not intended to be limiting. The modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purpose of the present disclosure.

The present application relates to the field of artificial intelligence, and first illustrates some concepts of artificial intelligence, as follows:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural language processing (N L P) is an important direction in the fields of computer science and artificial intelligence, and it is a research on various theories and methods that enable efficient communication between people and computers using natural language.

A network structure diagram of a video recommendation method provided in an embodiment of the present application is described below with reference to fig. 1:

as shown in fig. 1, in the present application, the server 103 may be one server or multiple servers, the server 103 establishes a communication connection with the server 101 through the network 102, and the server 103 acquires data of the server 101 through the network. Specifically, the server 103 may obtain video information of a target video on the server 101 through the network 102, and then determine a target image feature vector, a target text feature vector, and a target audio feature vector according to the video information; obtaining semantic representation of the target video through a target semantic conversion model based on the target image feature vector, the target text feature vector and the target audio feature vector; and when the semantic representation of the target video is matched with the interesting image of the target object, pushing the target video to the target object.

Referring to fig. 2, in the present application, feature vectors of various modalities of a video, audio feature vectors, image feature vectors, and text feature vectors are extracted, information interaction among the modalities is realized through multimodal joint learning, and a loss function of a model is adjusted, so that a trained model can output richer semantic representations, and then a video is recommended to a user through the semantic representations of the video and an interest portrait of the user, and the video recommended to the user is more suitable for interests and hobbies of the user. In addition, each single-mode feature of the video has a complementary effect, and after multi-mode combined learning is carried out, the single-mode feature representation also contains the advantage information of other modes, so that the semantic representation of the video is finally improved.

The method for video recommendation in the present invention will be described in detail below from the perspective of a video recommendation device, which may be a server or a service unit in the server, and is not limited specifically.

Referring to fig. 3, fig. 3 is a flowchart illustrating a video recommendation method according to an embodiment of the present application, including:

301. and determining a target image feature vector.

In this embodiment, the video recommendation apparatus may first determine a target video, where the target video is a video to be recommended to a target object, and then may determine a target image feature vector corresponding to the target video, where the target image feature vector is an image feature vector corresponding to the target video, and the target image feature vector includes channel information and optical flow information of a video image frame corresponding to the target video. Specifically, the video image frame corresponding to the target video may be a plurality of frames of video images arbitrarily selected from the target video, or may be a frame randomly selected from video image frames corresponding to each second in the target video, then the selected video image frame is subjected to feature extraction of an inclusion-rest-v 2 model, and target image features are output, where the target image feature vectors include channel feature vectors and optical flow feature vectors.

When the image features are extracted from the video image frames, the extraction may be performed by an inclusion-rest-v 2 model, or may be performed by other models of a deep learning network structure, and the extraction is not particularly limited.

302. And determining a target text feature vector.

Specifically, the video recommending device may perform preprocessing operations on text data such as a title name of the target video and associated audio (e.g., background music) of the target video, for example, perform word segmentation and word de-deactivation on the text data corresponding to the target video, then input the text data after the word segmentation and word de-deactivation into a bidirectional long-Short Term Memory network (Bi-L STM, Bi L ong Short-Term Memory), and output the target text feature vector corresponding to the target video through a self-attentiveness system (self-attention) part.

It should be noted that, the text feature vector of the text data extracted by the Bi-L STM model may be replaced by another deep learning model, such as a Convolutional Neural Network (CNN), which is not limited specifically, as long as the text data corresponding to the target video can be processed to obtain the target text feature vector.

303. And determining a target audio feature vector.

In this embodiment, the video recommendation apparatus may determine a target audio feature vector, where the target audio feature vector is an audio feature vector corresponding to a target video. Specifically, the video recommendation apparatus may extract a Mel-scale frequency Cepstral Coefficients (MFCC) feature corresponding to the audio of the target video, and calculate the audio feature through a vggish model based on the MFCC feature. It is understood that the video recommendation apparatus may train a general model and a specific model for extracting audio features in advance according to the distribution characteristics of the audio data, wherein the general model is suitable for extracting audio features of all types of videos, and the specific model is only suitable for extracting audio features of a certain type of videos, for example, videos of music or phase sounds may identify audio features by training the specific model.

It should be noted that, the audio features corresponding to the video are extracted through the vggish model, and certainly, the audio features of the video may also be extracted through other network structures, which is not limited specifically.

It should be noted that, the target image feature vector may be determined through step 301, the target text feature vector may be determined through step 302, and the target audio feature vector may be determined through step 303, however, there is no sequential limitation between these three steps, and step 301, step 302, step 303, or both may be performed first, and specifically, the present invention is not limited.

Referring to fig. 4, please refer to fig. 4, wherein fig. 4 is a schematic diagram illustrating feature extraction of each modality of a video according to an embodiment of the present application.

The video recommendation device can acquire image data, audio data and text data of a target video, wherein the image data comprises channel data and optical flow data, and then perform feature extraction through a feature extraction model corresponding to each data to obtain a channel feature vector 401, an optical flow feature vector 402, an audio feature vector 403 and a text feature vector 404.

304. And obtaining the semantic representation of the target video through a target semantic conversion model based on the target image feature vector, the target text feature vector and the target audio feature vector.

In this embodiment, after obtaining the target image feature vector, the target text feature vector, and the target audio feature vector, the video recommending apparatus may input the target image feature vector, the target text feature vector, and the target audio feature vector into a target semantic conversion model trained in advance, and output a semantic representation of the target video, where the target semantic conversion model corresponds to the target image feature, the target text feature, and the target audio feature. That is to say, the image feature vector, the text feature vector and the audio feature vector of each of the plurality of videos may be trained in advance to obtain a target semantic conversion model, and then, when semantic representation identification needs to be performed on the target video, the semantic representation of the target video may be output only by inputting the image feature vector, the text feature vector and the audio feature vector corresponding to the target video into the target semantic conversion model.

305. And when the semantic representation of the target video is matched with the interesting image of the target object, pushing the target video to the target object.

In this embodiment, after obtaining the semantic representation of the target video, the video recommendation apparatus may determine whether the semantic representation of the target video matches the interest image of the target object, and if the semantic representation of the target video matches the interest image of the target object, push the target video to the target object.

In summary, it can be seen that, when a target video is recommended to a user, the target video can be pushed to the user according to whether the semantic representation of the target video is matched with the user portrait, and the semantic representation of the target video is obtained through a target semantic conversion model based on an image feature vector, a text feature vector and an audio feature vector of the target video. Therefore, the method and the device have the advantages that the characteristic information of each single mode of the video is integrated, the richness of the final video semantic representation is improved, the obtained video semantic representation is more fit with the theme of the target video, and the video which is more in line with the user interest is recommended for the user.

The video recommendation method provided in the embodiment of the present application is explained above, and a training process of the target semantic conversion model provided in the embodiment of the present application is explained below.

Referring to fig. 5, fig. 5 is a schematic diagram of a training process of a target semantic conversion model according to an embodiment of the present application, including:

501. an image feature vector, a text feature vector, and an audio feature vector are determined for each of a plurality of videos.

In this embodiment, in the training process of the target semantic conversion model, a plurality of videos in the training corpus may be preprocessed to obtain an image feature vector, a text feature vector, and an audio feature vector of each video in the plurality of videos, where the specific preprocessing mainly includes: extracting features, reducing dimension of the features, processing null values of the features and normalizing the features; null processing of target values, one-hot conversion of target values, and the like.

It should be noted that, the manner of determining the image feature vector, the text feature vector, and the audio feature vector of each of the plurality of videos is similar to that in step 301, step 302, and step 303 in fig. 3, which has already been described above, and details are not repeated here.

It should be noted that after obtaining the image feature vector, the text feature vector, and the audio feature vector of each of the plurality of videos, all the feature vectors may be divided into two parts, the proportion may be 75% and 25%, 75% of the data may be used as training data for training a model, and the other 25% of the data may be used as test data for testing the model, which is not limited specifically, but may also be other dividing manners.

502. And initializing a semantic conversion module.

In this embodiment, the semantic conversion model is initialized, that is, parameters in the semantic conversion model are initialized.

It should be noted that, the image feature vector, the text feature vector, and the audio feature vector of each of the plurality of videos may be determined through step 501, and the semantic conversion model may be initialized through step 502, however, there is no restriction on the execution sequence between these two steps, and step 501 may be executed first, or step 502 may be executed first, or executed simultaneously, which is not limited specifically.

503. And obtaining the semantic representation of the first video through the initialized semantic conversion model based on the first image feature vector, the first text feature vector and the first audio feature vector of the first video.

In this embodiment, the video recommending apparatus may randomly select a first video from a plurality of videos, and input a semantic conversion model with initialized input values to a first image feature vector, a second text feature vector, and a first audio feature vector corresponding to the first video, to obtain a semantic representation of the first video.

504. And updating a loss function of the semantic conversion model.

In this embodiment, after obtaining the semantic representation of the first video, the loss function of the semantic conversion module is updated. The update method is not particularly limited here.

505. And (6) iteratively executing the steps 503 to 504 until a preset iteration termination condition is reached, and determining the semantic conversion model reaching the preset iteration termination condition as the target semantic conversion model.

In this embodiment, in the iterative computation process, it is determined whether an iteration termination condition is currently satisfied, if so, the iteration is stopped, the semantic conversion model when the iteration is stopped is determined as the target semantic conversion model, and if not, steps 503 to 504 are repeatedly performed. That is, it may be determined whether the iteration number reaches a preset value, and if so, it is determined that a preset iteration termination condition is satisfied.

Alternatively, the first and second electrodes may be,

whether the loss function of the semantic conversion model is converged can be judged, namely, the value of the loss function does not change greatly after multiple iterations, and if yes, the preset iteration termination condition is determined to be met.

In practical applications, other conditions may also be used as iteration termination conditions, which are not limited herein, and the loss function may be a Cross Entropy (Cross Entropy L oss) function or a softmax function, which is not limited herein.

It should be noted that, after the target semantic conversion model is trained to a complete meeting, the target semantic conversion model may be tested through test data, and specifically, whether the target semantic conversion model meets an expected requirement may be determined through a score (i.e., a ratio of results to be output by the model) output by the target semantic conversion model, a precision ratio, a recall ratio, and an F1(F score) index.

In summary, it can be seen that, in the embodiment of the application, in the training process of the target semantic model, feature information of each single mode of a video is integrated, and meanwhile, information interaction is performed between the single modes, so that the richness of the final video semantic representation is improved, and thus the trained target semantic conversion model can improve the semantic representation of the video when in use.

In the training process of fig. 5 for the target semantic conversion model, an optimization operation for the target semantic model may also be added, and the optimization may be performed from two aspects, i.e., the training corpus and the back propagation, respectively, and the optimization of the training corpus includes feature migration and feature conversion between feature vectors of each modality, which are described below:

1. feature transformation between feature vectors in the corpus:

in one embodiment, before obtaining the semantic representation of the first video through the initialized semantic conversion model based on the first image feature vector, the first text feature vector, and the first audio feature vector of the first video, the method further includes:

determining a probability distribution of the first image feature vector, a probability distribution of the first text feature vector and a probability distribution of the first audio feature vector;

modifying the probability distribution of the first feature vector so as to minimize the relative entropy of the probability distribution of the modified first feature vector and the probability distributions of other feature vectors, wherein the first feature vector is any one of a first image feature vector, a first text feature vector and a first audio feature vector, and the other feature vectors are the feature vectors except the first feature vector in the first image feature vector, the first text feature vector and the first audio feature vector;

obtaining the semantic representation of the first video through the initialized semantic conversion model based on the first image feature vector, the first text feature vector and the first audio feature vector of the first video comprises:

In this embodiment, the corpus may be optimized through a feature transformation, specifically, before a first image feature vector, a first text feature vector, and a first audio feature vector of a first video are trained through a semantic transformation model, a probability distribution of the first feature vector may be modified, so that a relative entropy between the modified probability distribution of the first feature vector and probability distributions of other feature vectors is minimum, that is, a probability distribution of a feature vector of a certain modality in the first video is modified, so that a relative entropy between the feature vector of the modality and probability distributions of feature vectors of other two modalities is minimum, and then the feature vector after the probability distribution is modified may be input to an initialized semantic transformation module to obtain a semantic representation of the first video.

It should be noted that feature conversion may be performed on feature vectors of all three modalities in the first video, or feature conversion may be performed only on feature vectors of one or two modalities of the three modalities, which is not limited specifically.

2. A feature migration approach between feature vectors in a corpus:

in one embodiment, before obtaining the semantic representation of the first video through the initialized semantic conversion model based on the first image feature vector, the first text feature vector and the first audio feature vector of the first video, the method further includes:

migrating the second eigenvector to the vector space of the third eigenvector so that the vector distance between the second eigenvector and the third eigenvector is minimal;

In this embodiment, feature transfer (migration) between modalities is learned at a feature level, and information between single modalities is exchanged with each other to achieve a complementary effect. Specifically, a vector distance between any two feature vectors among the image feature vector, the text feature vector, and the audio feature vector may be determined, and another one of the two feature vectors may be migrated to a vector space of another one of the feature vectors, where after the migration, the vector distance between the two feature vectors is the smallest. Therefore, information interaction among all single modes of the first video can be improved, and the richness of semantic representation of the first video which is finally output is improved.

Referring to fig. 6, fig. 6 is a schematic flow chart of feature migration according to an embodiment of the present application: after the features of each modality of the first video are extracted by the feature extraction model, the features between any two modalities may be migrated, and here, taking the migration of the channel features and the optical flow features as an example, the channel features are migrated into the vector space where the optical flow features are located until the vector distance between the channel features and the optical flow features is minimum.

It is understood that the above description is only for the channel feature and the optical flow feature as an example, but it is also possible to perform feature migration between two other features, for example, feature migration between an audio feature and a text feature; in addition, only one of the image feature, the text feature and the audio feature may be migrated, or the feature migration operation may be performed on any two features, which is not limited specifically.

3. Another feature migration method between feature vectors in the corpus:

establishing a target semantic space;

migrating a fourth feature vector and a fifth feature vector to a target semantic space, wherein the fourth feature vector and the fifth feature vector are any two feature vectors of the first image feature vector, the first text feature vector and the first audio feature vector;

determining a target feature vector based on the fourth feature vector and the fifth feature vector after the migration, wherein the target feature vector is a feature vector in a target semantic space, and the vector distances between the target feature vector and the fourth feature vector and between the target feature vector and the fifth feature vector are smaller than a preset value;

In this embodiment, when learning of feature transfer (transition) between modalities is performed at a feature level, information between the individual modalities is exchanged with each other, so that a complementary role between features of the individual modalities is achieved. As explained in connection with fig. 7, after the features of the respective modalities are extracted by the respective feature extraction models, a target semantic space can be established, and then two feature vectors are arbitrarily selected from among the image feature vector, the text feature vector and the audio feature vector, wherein the image feature vector comprises a channel feature vector and an optical flow feature vector, and the two feature vectors are taken as the channel feature vector and the optical flow feature vector for illustration, the channel feature and the optical flow feature are migrated to the target semantic space, after the migration is completed, a fifth feature vector is determined from the target semantic space, and in the target semantic space, distances between the fifth eigenvector and the channel eigenvector and between the optical flow eigenvector are smaller than a preset value, and then, the fifth feature vector can be output to a semantic conversion model to obtain a semantic representation of the first video.

It should be noted that, the above description only takes the case that the channel feature vector and the optical flow feature vector are migrated to the new semantic space as an example, but other two features may also be used, such as the channel feature vector and the audio feature vector, the audio feature vector and the text feature vector; in addition, when performing model training, feature migration may be performed only on a feature vector between any two modalities in an image feature vector, a text feature vector, and an audio feature vector of a video, and certainly, feature migration may be performed on features between any two modalities, which is not limited specifically.

It should be further noted that the eigenspace transformation may be replaced by other eigenspace transformation manners, such as function transformation by linear rectification function (Rectified L initial Unit, Re L U), Softmax, etc., adding hidden layers, adding convolutional layers, etc., and is not limited specifically.

4. The mode of optimizing the back propagation:

in one embodiment, updating the penalty function of the semantic conversion model comprises:

and updating the weight corresponding to the first image feature vector, the weight corresponding to the first text feature vector and the weight corresponding to the first audio feature vector in the loss function through a back propagation algorithm.

In this embodiment, 4 times of back propagation may be performed through a back propagation algorithm to update the loss function of the semantic conversion module, where the weights corresponding to the features of 2 of the modalities are respectively fixed for the first 3 times, only the weights corresponding to the features of the remaining 1 modality are modified, and the weight corresponding to the feature of 3 modalities is modified for the 4 th time. For example, only the weight corresponding to the image feature vector, the weight corresponding to the text feature vector, and the weight corresponding to the audio feature vector are modified in the first back propagation process, only the weight corresponding to the text feature vector, and the weight corresponding to the image feature vector and the weight corresponding to the audio feature vector are modified in the second back propagation process, only the weight corresponding to the audio feature vector, and the weight corresponding to the image feature vector and the weight corresponding to the text feature vector are modified in the third back propagation process, and all the weights corresponding to the image feature vector, the weight corresponding to the text feature vector, and the vector corresponding to the audio feature vector are modified in the fourth back propagation process.

In the above, the training process of the target semantic model is illustrated by fig. 5, the target semantic model may be trained by self-adaptive learning based on Knowledge Distillation (KD), and by using the idea of KD, soft target (soft target) is added during optimization, so as to obtain a more optimized target semantic model.

Referring to fig. 8, an effective complex model (i.e., a first conversion model) is shown in 801, a soft target is an output of the complex model (output is a classification result), 802 is a simple model including two loss functions, one is a cross entropy loss function of the soft target output by 801, and the other is a cross entropy loss function of a hard target (which may be an output of the model), and the simple model in 802 adjusts a weight of the cross entropy loss function of the soft target (i.e., λ in 802) and a weight of the cross entropy loss function of the hard target (i.e., 1- λ in 802) to obtain a total loss function. The trained model is a simple model shown as 803 in fig. 8, and the feature vector is input into the simple model of 803 to output a prediction result.

Referring to fig. 9, fig. 9 is another schematic diagram of a training process of a target semantic model according to an embodiment of the present application, including:

901. an image feature vector, a text feature vector, and an audio feature vector are determined for each of a plurality of videos.

It should be noted that the determining, in step 901, the image feature vector, the text feature vector, and the audio feature vector of each of the plurality of videos is similar to the determining, in step 301, step 302, and step 303 in fig. 3, and the above description has been given in detail, and details are not repeated here.

902. And obtaining a classification result of the second video through the first conversion model based on the second image characteristics.

In this embodiment, a model corresponding to an image feature may be selected as a first conversion model, where the first conversion model is a model corresponding to an image feature of a second video, the second image feature is an image feature vector corresponding to a video image frame of the second video, and the second video is any one of multiple videos, and a specific training process of the first conversion model may refer to the training process shown in fig. 5, which is not described herein again. After the first conversion model is obtained, the second image feature may be input into the first conversion model, and then the classification result of the second video may be output.

903. And obtaining semantic representation of the second video through a second conversion model based on the classification result of the second video, the second image feature vector, the second text feature vector and the second audio feature vector.

In this embodiment, an additional soft target loss function may be added on the basis of the second conversion model, and the original loss function of the second conversion model and the weight of the soft target may be adjusted through a lambda expression.

904. The weights of the loss functions of the second conversion model are updated.

In this embodiment, the weight of the loss function of the second conversion model may be updated, and specifically, in the training process of the model in the present application, an additional soft target is added on the basis of the hard target to calculate, so that the total loss function of the model may be realized by the following formula:

L＝λL^soft+(1-λ)L^hard

where L is the final loss function, λ is the weight of the loss function for soft target, L^softAs a loss function of softtarget, L^hardIs the loss function of hard target.

The final L loss function is obtained by adjusting the weight of the loss function of soft target and the loss function of hard target.

905. And iterating and executing the steps 902 to 904 until a preset iteration termination condition is reached, and determining the second conversion model reaching the preset iteration termination condition as a standard semantic conversion model.

In this embodiment, in the iterative computation process, it is determined whether an iteration termination condition is currently satisfied, if so, the iteration is stopped, the second conversion model when the iteration is stopped is determined as the target semantic conversion model, and if not, steps 902 to 904 are repeatedly performed. That is, it may be determined whether the iteration number reaches a preset value, and if so, it is determined that a preset iteration termination condition is satisfied.

Alternatively, the first and second electrodes may be,

whether the loss function of the second conversion model is converged can be judged, that is, the value of the loss function does not change greatly after multiple iterations, and if yes, the preset iteration termination condition is determined to be met.

It should be noted that the above optimization method for the target semantic conversion model in fig. 5 may also be applied to the optimization of the model shown in fig. 9, which has been specifically described above and will not be described herein again.

In summary, in this embodiment, in the process of training the target semantic model, the softtarget is added to the loss function, so that feature information of each single mode of the video is integrated, information interaction is performed between the single modes, and the richness of the final video semantic representation is improved, so that the trained target semantic conversion model can improve the effect of the semantic representation of the video when in use.

The embodiments of the present application are described above from the perspective of a method of video recommendation, and are described below from the perspective of an apparatus of video recommendation.

Referring to fig. 10, fig. 10 is a schematic view of a virtual structure of a video recommendation apparatus according to an embodiment of the present application, including:

a first determining unit 1001, configured to determine a target image feature vector, where the target image feature vector is an image feature vector corresponding to a target video, the target video is a video to be recommended to a target object, and the target image feature vector includes channel information and optical flow information of a video image frame corresponding to the target video;

the first determining unit 1001 is further configured to determine a target text feature vector, where the target text feature vector is a text feature vector corresponding to the target video, and the target text feature vector includes title information of the target video and attribute information of audio associated with the target video;

the first determining unit 1001 further determines a target audio feature vector, where the target audio feature vector is an audio feature vector corresponding to the target video;

a processing unit 1002, configured to obtain a semantic representation of the target video through a target semantic conversion model based on the target image feature vector, the target text feature vector, and the target audio feature vector, where the target semantic conversion model corresponds to the target image feature, the target text feature, and the target audio feature;

a pushing unit 1003, configured to push the target video to the target object when the semantic representation of the target video matches the interest image of the target object.

Optionally, the video recommendation apparatus further includes:

the model training unit 1004 is specifically configured to:

step 2) initializing a semantic conversion model;

step 4), updating a loss function of the semantic conversion model;

Optionally, the video recommendation apparatus further includes:

a second determining unit 1005 for determining a probability distribution of the first image feature vector, a probability distribution of the first text feature vector, and a probability distribution of the first audio feature vector;

a feature conversion unit 1006, configured to modify a probability distribution of a first feature vector so as to minimize a relative entropy of the modified probability distribution of the first feature vector and probability distributions of other feature vectors, where the first feature vector is any one of the first image feature vector, the first text feature vector, and the first audio feature vector, and the other feature vectors are feature vectors other than the first feature vector in the first image feature vector, the first text feature vector, and the first audio feature vector;

the model training unit 1004 obtains the semantic representation of the first video through the initialized semantic conversion model based on the first image feature vector, the first text feature vector and the first audio feature vector of the first video, and includes:

Optionally, the second determining unit 1005 is further configured to determine a vector distance between a second feature vector and a third feature vector, where the second feature vector and the third feature vector are any two feature vectors of the first image feature vector, the first text feature vector, and the first audio feature vector;

the feature conversion unit 1006, further configured to migrate the second eigenvector to a vector space of the third eigenvector, so that a vector distance between the second eigenvector and the third eigenvector is minimum;

Optionally, the video recommendation apparatus further includes:

an establishing unit 1007 for establishing a target semantic space;

the feature conversion unit 1006 is further configured to migrate a fourth feature vector and a fifth feature vector to the target semantic space, where the fourth feature vector and the fifth feature vector are any two feature vectors of the first image feature vector, the first text feature vector, and the first audio feature vector;

a third determining unit 1008, configured to determine a target feature vector based on the fourth feature vector and the fifth feature vector after migration, where the target feature vector is a feature vector in the target semantic space whose vector distances from the fourth feature vector and the fifth feature vector are both smaller than a preset value;

Optionally, the model training unit 1004 is further configured to:

Optionally, the model training unit 1004 updating the loss function of the semantic conversion model includes:

Optionally, the model training unit 1004 is further configured to:

or the like, or, alternatively,

Referring to fig. 11, fig. 11 is a schematic diagram of a hardware structure of a server according to an embodiment of the present invention, where the server 1100 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1122 (e.g., one or more processors) and a memory 1132, and one or more storage media 1130 (e.g., one or more mass storage devices) storing an application program 1142 or data 1144. Memory 1132 and storage media 1130 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1122 may be provided in communication with the storage medium 1130 to execute a series of instruction operations in the storage medium 1130 on the server 1100.

The server 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input-output interfaces 1158, and/or one or more operating systems 1141, such as Windows ServerTM, Mac OS XTM, UnixTM, &lTtTtranslation = L "&gTt L &lTt/T &gTt inxTM, FreeBSDTM, and the like.

The steps performed by the video processing apparatus in the above-described embodiment may be based on the server configuration shown in fig. 11.

The embodiment of the present application further provides a computer storage medium, on which a program is stored, and the program, when executed by a processor, implements the steps of the video recommendation method described above.

The embodiment of the application further provides a processor, wherein the processor is used for executing a program, and the program executes the steps of the video recommendation method when running.

The embodiment of the application also provides terminal equipment, the equipment comprises a processor, a memory and a program which is stored on the memory and can be run on the processor, and the steps of the video recommendation method are realized when the processor executes the program.

The present application further provides a computer program product adapted to perform the steps of the above-described video recommendation method when executed on a video recommendation device.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable video recommendation device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable video recommendation device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable video recommendation device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable video recommendation device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer implemented process such that the instructions which execute on the computer or other programmable device provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for video recommendation, comprising:

determining an image feature vector, a text feature vector, and an audio feature vector for each of a plurality of videos;

initializing a semantic conversion model;

obtaining a semantic representation of a first video through the initialized semantic conversion model based on a first image feature vector, a first text feature vector and a first audio feature vector of the first video, wherein the first video is any one of the videos;

updating a loss function of the semantic conversion model;

when a preset iteration termination condition is reached, determining the semantic conversion model reaching the preset iteration termination condition as the target semantic conversion model;

determining semantic features of a target video according to the target semantic conversion model, wherein the target video is a video to be recommended to a target object;

2. The method of claim 1, wherein before the semantic representation of the first video is obtained by the semantic conversion model after initialization based on the first image feature vector, the first text feature vector, and the first audio feature vector of the first video, the method further comprises:

3. The method of claim 1, wherein before the semantic representation of the first video is obtained by the semantic conversion model after initialization based on the first image feature vector, the first text feature vector, and the first audio feature vector of the first video, the method further comprises:

4. The method of claim 1, wherein before the semantic representation of the first video is obtained by the semantic conversion model after initialization based on the first image feature vector, the first text feature vector, and the first audio feature vector of the first video, the method further comprises:

establishing a target semantic space;

5. The method according to any one of claims 1 to 4, further comprising:

6. The method according to any one of claims 1 to 4, wherein the updating the loss function of the semantic conversion model comprises:

7. The method according to any one of claims 1 to 4, further comprising:

or the like, or, alternatively,

8. The method according to any one of claims 1 to 4, wherein the determining semantic features of a target video according to the target semantic conversion model comprises:

determining a target image feature vector, wherein the target image feature vector is an image feature vector corresponding to the target video, and the target image feature vector comprises channel information and optical flow information of a video image frame corresponding to the target video;

and obtaining semantic representation of the target video through a target semantic conversion model based on the target image feature vector, the target text feature vector and the target audio feature vector, wherein the target semantic conversion model corresponds to the target image feature, the target text feature and the target audio feature.

9. An apparatus for video recommendation, comprising:

a model training unit to:

initializing a semantic conversion model;

updating a loss function of the semantic conversion model;

the processing unit is used for determining semantic features of a target video according to the target semantic conversion model, wherein the target video is a video to be recommended to a target object;

and the recommending unit is used for pushing the target video to the target object when the semantic representation of the target video is matched with the interesting image of the target object.

10. A computer storage medium characterized in that it comprises instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-8.