CN113343936B

CN113343936B - Training method and training device for video characterization model

Info

Publication number: CN113343936B
Application number: CN202110799842.8A
Authority: CN
Inventors: 林和政; 吴翔宇
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Filing date: 2021-07-15
Publication date: 2024-07-12
Anticipated expiration: 2041-07-15

Abstract

The disclosure relates to a training method of a video characterization model, comprising the following steps: acquiring training videos, annotation data of the training videos about main tasks and annotation data of the training videos about auxiliary tasks; acquiring multi-modal information from a training video; inputting the various modal information into the corresponding feature extraction model respectively, and extracting the features of the various modal information; inputting the characteristics of the multi-mode information into a characteristic fusion model to obtain multi-mode fusion characteristics; inputting the multi-mode fusion characteristics into a main task model to obtain main task prediction data; inputting the characteristic of one mode information related to the auxiliary task in the characteristics of the multiple mode information extracted by the multiple characteristic extraction models into the auxiliary task model to obtain auxiliary task prediction data; parameters of the respective models are adjusted based on the primary task prediction data and the annotation data for the primary task, the secondary task prediction data and the annotation data for the secondary task, and the video characterization model is trained.

Description

Training method and training device for video characterization model

Technical Field

The disclosure relates to the technical field of videos, in particular to a training method and device for a video characterization model, a video classification method and device based on the video characterization model, a video clustering recommendation method and device based on the video characterization model, and a video searching method and device based on the video characterization model.

Background

Short video is becoming increasingly important as one of the important carriers of multimedia content at present. A video cannot be completely characterized simply by images, so that information such as audio, text and the like needs to be introduced into the video, and thus, the video is characterized by multi-modal information. And, for the characterization of video content, the quality of the video content can have a greater impact on downstream tasks. If the video is characterized by high quality, the video classification task, the video clustering recommendation task, the video searching task and the like which are downstream tasks can be promoted. In contrast, if the video representation is low-quality or unbalanced in modal information, the results of the video classification task, the video clustering recommendation task, the video searching task and the like are affected, so that the results provided for the user are inaccurate, and the user experience is reduced.

Disclosure of Invention

The present disclosure provides a training method and apparatus for a video characterization model, a video classification method and apparatus based on a video characterization model, a video clustering recommendation method and apparatus based on a video characterization model, a video search method and apparatus based on a video characterization model, an electronic device, a computer-readable storage medium, and a computer program product to solve at least the problems in the related art described above.

According to a first aspect of embodiments of the present disclosure, there is provided a training method of a video characterization model, the video characterization model including a plurality of feature extraction models and feature fusion models respectively corresponding to a plurality of modality information, the training method including: acquiring a training video, annotation data of the training video about a main task and annotation data of the training video about an auxiliary task, wherein the main task is a task based on characteristics of the multi-modal information, and the auxiliary task is a task based on characteristics of one modal information in the multi-modal information; acquiring the multi-modal information contained in the training video from the training video; respectively inputting the multiple modal information into corresponding feature extraction models, and extracting the features of the multiple modal information of the training video by the multiple feature extraction models; inputting the characteristics of the multi-mode information into the characteristic fusion model, and obtaining multi-mode fusion characteristics of the training video by the characteristic fusion model; inputting the multi-mode fusion characteristics into a main task model, and obtaining main task prediction data of the training video by the main task model; inputting the characteristic of one mode information related to the auxiliary task in the characteristics of the multiple mode information of the training video extracted by the multiple characteristic extraction models into an auxiliary task model, and obtaining auxiliary task prediction data of the training video by the auxiliary task model; and adjusting parameters of the feature extraction models, the feature fusion model, the main task model and the auxiliary task model based on the main task prediction data and the annotation data of the training video about the main task, the auxiliary task prediction data and the annotation data of the training video about the auxiliary task, and training the video characterization model.

Optionally, adjusting parameters of the plurality of feature extraction models, the feature fusion model, the primary task model, the auxiliary task model, and the training video based on the primary task prediction data and the annotation data for the primary task, the auxiliary task prediction data, and the annotation data for the auxiliary task of the training video, the step of training the video characterization model includes: calculating main task loss between the main task prediction data and the labeling data of the training video about main tasks by using a main task loss function, and calculating auxiliary task loss between the auxiliary task prediction data and the labeling data of the training video about auxiliary tasks by using an auxiliary task loss function; and adjusting parameters of the feature extraction models, the feature fusion model, the main task model and the auxiliary task model according to a value obtained by summing the main task loss and the auxiliary task loss according to a preset weight ratio, and training the video characterization model.

Optionally, the plurality of modal information includes any two or more of image information, text information, audio information, geographical location information, and time information.

Optionally, the main task is one of a video classification task, a video clustering recommendation task and a video search task.

Optionally, the auxiliary task is one of a video classification task, a video clustering recommendation task, a video search task, a task of classifying audio information in video, and a task of predicting a context by using text information.

Optionally, the auxiliary tasks include n auxiliary tasks, where 1 n is equal to or less than the kind of the modal information in the plurality of modal information.

Optionally, before inputting the features of the multiple modal information into the feature fusion model, the features of the multiple modal information are mapped to the same feature space respectively, and then the mapped features of the multiple modal information are input into the feature fusion model.

Optionally, the plurality of modal information at least includes image information, and the auxiliary task is a task based on characteristics of the image information.

Optionally, the multi-modal information at least includes audio information, and the auxiliary task is a task based on characteristics of the audio information.

According to a second aspect of embodiments of the present disclosure, there is provided a video classification method based on a video characterization model, including: acquiring videos to be classified; inputting the video to be classified into a video characterization model trained according to the training method, and obtaining multi-mode fusion characteristics of the video to be classified by the video characterization model; and classifying the videos to be classified based on the multi-mode fusion features.

Optionally, the step of classifying the video to be classified based on the multimodal fusion feature includes: and inputting the multi-mode fusion features into a video classification model to obtain a classification result of the video to be classified.

According to a third aspect of embodiments of the present disclosure, there is provided a video clustering recommendation method based on a video characterization model, including: acquiring a video to be recommended; inputting the video to be recommended into a video characterization model trained according to the training method, and obtaining multi-mode fusion characteristics of the video to be recommended by the video characterization model; and clustering and recommending the video to be recommended based on the multi-mode fusion characteristics.

According to a fourth aspect of embodiments of the present disclosure, there is provided a video searching method based on a video characterization model, including: acquiring text information for video searching, and calculating characteristics of the text information; acquiring multi-mode fusion characteristics of the video by utilizing the video characterization model obtained through training according to the training method; and comparing the multimodal fusion characteristics with the characteristics of the text information, and determining the video with high similarity as a search result.

According to a fifth aspect of embodiments of the present disclosure, there is provided a training apparatus for a video characterization model including a plurality of feature extraction models and feature fusion models respectively corresponding to a plurality of modality information, the training apparatus including: a first acquisition unit configured to: acquiring a training video, annotation data of the training video about a main task and annotation data of the training video about an auxiliary task, wherein the main task is a task based on characteristics of the multi-modal information, and the auxiliary task is a task based on characteristics of one modal information in the multi-modal information; a second acquisition unit configured to: acquiring the multi-modal information contained in the training video from the training video; a feature extraction unit configured to: respectively inputting the multiple modal information into corresponding feature extraction models, and extracting the features of the multiple modal information of the training video by the multiple feature extraction models; a feature fusion unit configured to: inputting the characteristics of the multi-mode information into the characteristic fusion model, and obtaining multi-mode fusion characteristics of the training video by the characteristic fusion model; a primary task prediction unit configured to: inputting the multi-mode fusion characteristics into a main task model, and obtaining main task prediction data of the training video by the main task model; an auxiliary task prediction unit configured to: inputting the characteristic of one mode information related to the auxiliary task in the characteristics of the multiple mode information of the training video extracted by the multiple characteristic extraction models into an auxiliary task model, and obtaining auxiliary task prediction data of the training video by the auxiliary task model; a training unit configured to: and adjusting parameters of the feature extraction models, the feature fusion model, the main task model and the auxiliary task model based on the main task prediction data and the annotation data of the training video about the main task, the auxiliary task prediction data and the annotation data of the training video about the auxiliary task, and training the video characterization model.

Optionally, the training unit is configured to: calculating main task loss between the main task prediction data and the labeling data of the training video about main tasks by using a main task loss function, and calculating auxiliary task loss between the auxiliary task prediction data and the labeling data of the training video about auxiliary tasks by using an auxiliary task loss function; and adjusting parameters of the feature extraction models, the feature fusion model, the main task model and the auxiliary task model according to a value obtained by summing the main task loss and the auxiliary task loss according to a preset weight ratio, and training the video characterization model.

Optionally, the feature fusion unit is configured to: before the features of the multi-modal information are input into the feature fusion model, the features of the multi-modal information are respectively mapped into the same feature space, and then the mapped features of the multi-modal information are input into the feature fusion model.

According to a sixth aspect of embodiments of the present disclosure, there is provided a video classification apparatus based on a video characterization model, including: an acquisition unit configured to: acquiring videos to be classified; a video characterization unit configured to: inputting the video to be classified into a video characterization model trained according to the training method, and obtaining multi-mode fusion characteristics of the video to be classified by the video characterization model; a video classification unit configured to: and classifying the videos to be classified based on the multi-mode fusion features.

Optionally, the video classification unit is configured to: and inputting the multi-mode fusion features into a video classification model to obtain a classification result of the video to be classified.

According to a seventh aspect of embodiments of the present disclosure, there is provided a video clustering recommendation device based on a video characterization model, including: an acquisition unit configured to: acquiring a video to be recommended; a video characterization unit configured to: inputting the video to be recommended into a video characterization model trained according to the training method, and obtaining multi-mode fusion characteristics of the video to be recommended by the video characterization model; a recommendation unit configured to: and clustering and recommending the video to be recommended based on the multi-mode fusion characteristics.

According to an eighth aspect of embodiments of the present disclosure, there is provided a video searching apparatus based on a video characterization model, including: a first acquisition unit configured to: acquiring text information for video searching, and calculating characteristics of the text information; a second acquisition unit configured to: acquiring multi-mode fusion characteristics of the video by utilizing the video characterization model obtained through training according to the training method; a comparison determination unit configured to: and comparing the multimodal fusion characteristics with the characteristics of the text information, and determining the video with high similarity as a search result.

According to a ninth aspect of embodiments of the present disclosure, there is provided an electronic device, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement a training method of a video characterization model as described above or a video classification method as described above or a video cluster recommendation method as described above or a video search method as described above.

According to a tenth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the training method of the video characterization model as described above or the video classification method as described above or the video cluster recommendation method as described above or the video search method as described above.

According to an eleventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement a training method of a video characterization model as described above or a video classification method as described above or a video cluster recommendation method as described above or a video search method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the training method and device for the video characterization model, the end-to-end video characterization model for video characterization through multiple modes can be achieved. In addition, the training method and the training device of the video characterization model adopt the auxiliary task to assist the training of the video characterization model, so that the trained video characterization model can more completely and fully characterize video features, and when the modal information of the video is seriously lost, the training method and the training device can rely on the modal information related to the auxiliary task (namely, the modal information of the video) to generate more reliable video features. And further, the business capability of a downstream task requiring video feature extraction is effectively improved.

In addition, according to the video classification method and device based on the video characterization model, the video clustering recommendation method and device based on the video characterization model and the video searching method and device based on the video characterization model, as the video characterization model can generate more reliable video features, more accurate results can be provided for users, and user experience is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is an implementation scene diagram illustrating a training method of a video characterization model, a video classification method based on the video characterization model, a video cluster recommendation method based on the video characterization model, and a video search method based on the video characterization model according to an exemplary embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a training method of a video characterization model according to an exemplary embodiment of the present disclosure.

Fig. 3 is a schematic diagram for explaining the structure of a video characterization model according to an exemplary embodiment of the present disclosure.

Fig. 4 is a flowchart illustrating a method of training a video characterization model according to another exemplary embodiment of the present disclosure.

Fig. 5 is a flowchart illustrating a video classification method based on a video characterization model according to an exemplary embodiment of the present disclosure.

Fig. 6 is a flowchart illustrating a video clustering recommendation method based on a video characterization model according to an exemplary embodiment of the present disclosure.

Fig. 7 is a flowchart illustrating a video search method based on a video characterization model according to an exemplary embodiment of the present disclosure.

Fig. 8 is a block diagram illustrating a training apparatus of a video characterization model according to an exemplary embodiment of the present disclosure.

Fig. 9 is a block diagram illustrating a video classification device based on a video characterization model according to an exemplary embodiment of the present disclosure.

Fig. 10 is a block diagram illustrating a video clustering recommendation device based on a video characterization model according to an exemplary embodiment of the present disclosure.

Fig. 11 is a block diagram illustrating a video search device based on a video characterization model according to an exemplary embodiment of the present disclosure.

Fig. 12 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "including at least one of a and B" includes three cases side by side as follows: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "at least one of the first and second steps is executed", that is, three cases are juxtaposed as follows: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.

"Modality (modality)" refers to the manner in which information is represented and exchanged on a particular physical medium. In the case where the representation of the information is performed by only one way (e.g., an image), the image information may be referred to as single-mode information. Where the representation of information is performed in common in multiple ways (e.g., images, audio, text, etc.), then the multiple modality information that is commonly present (i.e., images, audio, text, etc.) may be collectively referred to as multimodal information.

Because of the wide range of applications of short video, characterizing video features by only single-modality information has not been satisfactory, and thus, characterizing video by multi-modality information is desirable. However, when the video is represented by the multi-modal information, if the representation quality of the video is low or the modal information of the video is unbalanced, the results of the video classification task, the video clustering recommendation task, the video search task and the like, which are downstream tasks of video representation, are affected, so that the results provided for the user are not accurate enough, and the user experience is reduced.

Therefore, the present disclosure provides a training method and a training device for a video characterization model, which perform model training by performing appropriate primary tasks and auxiliary tasks on the downstream of the video characterization model, so as to obtain a better video characterization model. By utilizing the video characterization model, the video characteristics can be more completely and fully characterized, and when the modal information of the video is seriously lost, the video characteristics can be generated more reliably by more depending on the modal information related to the auxiliary task, so that the business capability of the downstream task can be effectively improved.

Fig. 1 is an implementation scene diagram illustrating a training method of a video characterization model, a video classification method based on the video characterization model, a video cluster recommendation method based on the video characterization model, and a video search method based on the video characterization model according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the implementation scenario includes a server 100, a user terminal 110, and a user terminal 120. The user terminals are not limited to the number and types shown in the figures, and include electronic devices such as smartphones, personal computers, tablet computers, and the like, but may also include any other electronic devices that need to perform video classification, video clustering recommendation, video searching, and the like. The server 100 may be a single server, a server cluster formed by a plurality of servers, or a cloud computing platform or a virtualization center.

According to an exemplary embodiment of the present disclosure, the server 100 may be used for training of video characterization models, as well as performing video classification based on video characterization models, video cluster recommendation based on video characterization models, video search based on video characterization models, and the like. Before the video classification, the video clustering recommendation and the video search are performed, the server 100 needs to train the video characterization model, and then characterize the features of the video by using the trained video characterization model, so that the features of the video are used for the video classification, the video clustering recommendation, the video search and the like. In training the video characterization model, the server 100 may acquire or set a data set for training through the user terminals 110 and 120, or may acquire corresponding data through a network for training. When used for performing video classification based on a video characterization model, video cluster recommendation based on a video characterization model, video search based on a video characterization model, and the like, the server 100 may acquire video files through the user terminals 110, 120 or the network, then characterize the video files using the video characterization model, and further perform video classification, video cluster recommendation, video search, and the like using the characterized video features.

According to an exemplary embodiment of the present disclosure, the user terminal 110, 120 may be used to perform video classification based on a video characterization model, video cluster recommendation based on a video characterization model, video search based on a video characterization model, etc., and may also be used to provide video files to the server 100 for the server 100 to perform video classification based on a video characterization model, video cluster recommendation based on a video characterization model, video search based on a video characterization model, etc. When the user terminals 110, 120 are configured to perform video classification based on the video characterization model, video clustering recommendation based on the video characterization model, video search based on the video characterization model, and the like, the video files may be sent to the server 100, the video characterization model in the server 100 is invoked to perform video characterization and perform video classification, video clustering recommendation, video search, and the like based on the result of the characterization fed back, or the video characterization model may be downloaded to the local area from the server 100 and then processed locally, which is not limited by the present disclosure. The server 100, the user terminals 110, 120 implement the corresponding apparatuses by performing the training method, the video classification method, the video cluster recommendation method, and the video search method of the video characterization model of the exemplary embodiments of the present disclosure. By training, a video characterization model with more reliable video features can be generated, so that more accurate results can be provided for users in downstream tasks needing to utilize the video features, such as a video classification method, a video clustering recommendation method and a video searching method, and user experience is improved.

Next, a training method of a video characterization model according to an exemplary embodiment of the present disclosure will be described in detail with reference to fig. 2.

Fig. 2 is a flowchart illustrating a training method of a video characterization model according to an exemplary embodiment of the present disclosure. Here, the video characterization model may include a plurality of feature extraction models, and a feature fusion model, which correspond to the plurality of modality information, respectively. According to exemplary embodiments of the present disclosure, the plurality of modality information may include, but is not limited to, any two or more of image information extracted from video, text information, audio information, geographical location information, time information, and the like. Here, the more the types of the modal information are, the more sufficient the information is obtained by the downstream task of the video characterization model, and accordingly, the service capability of the downstream task can be improved. The image information may include a cover image of the video, a plurality of key frame images in the video, and the like, the text information may include characters such as subtitles in the video, and the audio information may include audio such as various dubbing in the video. The feature extraction model is used to extract features from the modality information. For example, the feature extraction model may extract features from image information. According to an embodiment of the present disclosure, in a case where the plurality of modality information includes image information and audio information, the plurality of feature extraction models may include a feature extraction model corresponding to the image information and a feature extraction model corresponding to the audio information. The feature fusion model is used for combining a plurality of features into one feature with more discrimination capability than the input feature. It will be appreciated that the feature output by the feature fusion model is the output of the video characterization model. The video characterization model of the present disclosure may use various feature extraction models and feature fusion models in the related art.

As shown in fig. 2, in step S210, a training video, annotation data for a primary task of the training video, and annotation data for a secondary task of the training video are acquired. Here, the training video, the annotation data for the main task of the training video, and the annotation data for the auxiliary task of the training video are acquired as a training data set. The training data set may be acquired by various methods. Training videos may be acquired, for example, by the user terminals 110, 120 in fig. 1, or may be acquired via a network. The main task is a task based on the characteristics of the multi-mode information, the auxiliary task is a task based on the characteristics of one mode information in the multi-mode information, namely the auxiliary task is a task based on the characteristics of the single mode information.

According to an exemplary embodiment of the present disclosure, the main task may be one of a video classification task, a video cluster recommendation task, and a video search task, whereby the trained video characterization model can be better used for the video classification task, the video cluster recommendation task, or the video search task based on the characteristics of the multi-modality information. The auxiliary task may be one of a video classification task, a video clustering recommendation task, a video search task, a task of classifying audio information in video, and a task of inferring context using text information. The trained video characterization model can be better used for video classification tasks based on features of single-mode information, video clustering recommendation tasks, video searching tasks, tasks for classifying audio information in video or tasks for predicting context by using text information. In this way, in the case where the main task is a video classification task, the annotation data on the main task of the training video may be data obtained by manually classifying (tagging) the training video. In the case that the auxiliary task is a video classification task, the annotation data of the training video about the auxiliary task may also be data obtained by manually classifying the training video. That is, when the main task is the same as the auxiliary task, the labeling data may be the same or different from each other. When the main task and the auxiliary task are different, the labeling data of the two tasks may be different or the same. In addition, under the condition that the main task is a video clustering recommendation task or a video searching task, the annotation data of the training video about the main task can be data recommended by manually clustering the training video or data of the training video meeting the user requirement can be manually searched. Similarly, the annotation data of the training video about the auxiliary task can be the data obtained by manually realizing the auxiliary task.

In step S220, the training video acquired in step S210 acquires a plurality of modality information contained therein. Here, according to an embodiment of the present disclosure, in the case where the training video contains image information, audio information, and text information, then the image information, the audio information, and the text information may be acquired from the training video, respectively. In another embodiment of the present disclosure, the image information may include a multi-frame image (e.g., equally spaced 4-frame images) and a cover image taken from the training video. Here, the multi-frame image may be a plurality of key frames of the training video. The number of image frames taken from the training video may be selected based on the computational effort, time delay, etc. of the server 100. It can be understood that the more modality information is acquired, the more sufficient information is obtained by the downstream task of the video characterization model, and accordingly, the service capability of the downstream task can be improved. The multi-modal information contained therein may be obtained from the training video by various methods.

In step S230, the plurality of modal information acquired in step S220 is input into the corresponding feature extraction model, and features of the plurality of modal information of the training video are extracted by the plurality of feature extraction models. Here, according to the embodiment of the present disclosure, in the case where acquired multi-modality information is image information, audio information, and text information, the image information may be input into a feature extraction model for extracting image features, the audio information may be input into a feature extraction model for extracting audio features, and the text information may be input into a feature extraction model for extracting text features, whereby features of the image information, features of the audio information, and features of the text information are obtained by the three feature extraction models, respectively. Further, according to another embodiment of the present disclosure, in the case where the image information includes a multi-frame image and a cover image, the multi-frame image and the cover image may be input into two feature extraction models for extracting features of the image, respectively, and features of the multi-frame image and the cover image may be extracted, respectively. Here, the two feature extraction models for extracting the image features may be the same or different.

The feature extraction model may employ various models capable of extracting corresponding features. As one example, the feature extraction model for extracting image features may include ResNet (depth residual network) -50 or VGG19. The image features of the predetermined dimension can be obtained by performing convolution operations or the like on the image through ResNet-50 or VGG19.

As one example, a feature extraction model for extracting audio features may include MLP (multi-layer neural network). Three full-join operations are performed on the audio by the MLP to obtain audio features of a predetermined dimension.

As one example, the feature extraction model for extracting text features may include BERT (Bidirectional Encoder Representations from Transformers, converter-based bi-directional coded representation). There may be n conversion layers (transducers) in the model, and each layer contains 12 Self-Attention mechanisms (Self-Attention), whereby text features of a predetermined dimension can be obtained.

In step S240, the features of the multiple modality information extracted in step S230 are input into a feature fusion model, and the multi-modality fusion features of the training video are obtained from the feature fusion model. Here, according to the embodiment of the present disclosure, in the case where the features of the plurality of modality information are the features of the image information, the features of the audio information, and the features of the text information, the features of the image information, the features of the audio information, and the features of the text information may all be input to the feature fusion model. Further, according to another embodiment of the present disclosure, in the case where the image information is divided into the multi-frame image and the cover image, the features of the multi-frame image and the features of the cover image may be input together with the features of the audio information and the features of the text information to the feature fusion model.

The feature fusion model may employ various models capable of feature fusion. As one example, the feature fusion model may be a Multi-headed cross-attention (Multi-head Cross Attention) model, which would, for example: the multi-mode fusion characteristics of the training video are output by fusing the characteristics of the multi-frame image, the characteristics of the cover image, the characteristics of the audio information and the characteristics of the text information.

In addition, in the case where the features of the various modality information are extracted by the feature extraction models having different properties, the extracted feature spaces may be caused to be inconsistent, and thus, according to an exemplary embodiment of the present disclosure, before the features of the various modality information are input to the feature fusion model (step S240), the features of the various modality information may be mapped to the same feature space, respectively, and then the mapped features of the various modality information may be input to the feature fusion model. Here, the features of the plurality of modality information may be mapped to the same feature space, respectively, by various methods. As an example, features of the multi-modal information may be mapped to the same feature space through a plurality of Self Attention (Self Attention) mechanisms, respectively, and then the features of the mapped multi-modal information are fused by using a feature fusion model. The present disclosure is not so limited. Thus, the features of the plurality of modal information can be fused more effectively.

In step S250, the multimodal fusion feature obtained in step S240 is input into a primary task model, and primary task prediction data of the training video is obtained from the primary task model. Here, in case that the primary task is a video classification task, the primary task model may be a model for video classification, and the primary task prediction data may be a result obtained by classification by the model for video classification, according to an embodiment of the present disclosure. As an example, the primary task model may be a Softmax model, and the primary task prediction data of the training video obtained from the primary task model may be a category of the training video. In the case where the primary task is a video clustering recommendation task, the primary task model may be a model for video clustering recommendation, and the primary task prediction data may be a result obtained by recommending through the model for video clustering recommendation. In the case where the primary task is a video search task, the primary task model may be a model for video search, and the primary task prediction data may be a result obtained by searching through the model for video search. The main tasks described above are tasks based on characteristics of various modal information. That is, in the case where the plurality of modality information is image information, audio information, and text information, the main task may be performed based on the features of the image information, the features of the audio information, and the features of the text information.

In step S260, the feature of one mode information related to the auxiliary task among the features of the multiple mode information of the training video extracted by the multiple feature extraction models is input into the auxiliary task model, and the auxiliary task prediction data of the training video is obtained from the auxiliary task model. Here, one type of modality information related to the auxiliary task is one type of modality information among a plurality of types of modality information required for executing the auxiliary task. According to the embodiment of the disclosure, in the case that the multi-modal information can be image information, audio information and text information, if the auxiliary task is a task based on the characteristics of the image information, the characteristics of the image information are input into an auxiliary task model, and auxiliary task prediction data of a training video is obtained by the auxiliary task model. Thus, the capability of the video characterization model to extract features of image information can be enhanced by the task (auxiliary task) based on the features of the image information. Similarly, in the case where the auxiliary task is a task based on the characteristics of the audio information, the capability of the video characterization model to extract the characteristics of the audio information can be enhanced. In addition, according to another embodiment of the present disclosure, in the case where the image information includes a multi-frame image and a cover image, the features of the multi-frame image and the features of the cover image may be input into the auxiliary task model after feature fusion, and video classification may be performed based on the features after feature fusion. Here, the feature fusion of the features of the multi-frame image and the features of the cover image may be performed by the above-described feature fusion model, and the features of the multi-frame image and the features of the cover image may be mapped to the same feature space, respectively, before the feature fusion, and then the mapped features may be input into the feature fusion model. Feature fusion may also be performed in other ways, which is not limited by the present disclosure. The features of the multi-frame image and the features of the cover image may be mapped to the same feature space in the manner described above. The mapping of features may also be performed in other ways, which the present disclosure is not limited to.

According to an embodiment of the present disclosure, in a case where the auxiliary task is a video classification task based on features of image information, the auxiliary task model may be a video classification model that performs video classification based on the features of the image information, and the auxiliary task prediction data may be a result of the auxiliary task model performing video classification based on the features of the image information. As an example, the auxiliary task model may be an MLP model, and the auxiliary task prediction data of the training video obtained by the auxiliary task model may be a category of the training video. In the case where the primary task is also a video classification task, the secondary task prediction data may be the same as or different from the primary task prediction data. As described above, the auxiliary task may be one of a video classification task based on characteristics of any one of a plurality of modality information, a video clustering recommendation task, a video search task, a task of classifying audio information in a video, and a task of inferring a context using text information. Thus, the auxiliary task model is a model for realizing corresponding tasks based on the characteristics of the corresponding modal information, and corresponding auxiliary task prediction data is obtained.

In step S270, parameters of the plurality of feature extraction models, the feature fusion model, the primary task model, and the auxiliary task model are adjusted based on the primary task prediction data obtained in step S250, the labeling data about the primary task of the training video obtained in step S210, the auxiliary task prediction data obtained in step S260, and the labeling data about the auxiliary task of the training video obtained in step S210, and the video characterization model is trained. Here, according to an embodiment of the present disclosure, in a case where the auxiliary task is a task based on features of image information, and the image information includes a multi-frame image and a cover image, the model to be adjusted may further include a feature fusion model regarding the auxiliary task, and a model for mapping features of the multi-frame image and features of the cover image, respectively. That is, training of the video characterization model requires simultaneous training of the various models used in the training process. The models used in the training process are not necessarily all contained in the video characterization model. In contrast, in the case where the image information does not include the multi-frame image and the cover image, the auxiliary task may perform the corresponding auxiliary task directly based on the features of the image information extracted by the feature extraction model, without going through the feature fusion model and the model for mapping.

According to an exemplary embodiment of the present disclosure, step S270 may include: and calculating the main task loss between the main task prediction data and the labeling data of the training video about the main task by using the main task loss function, and calculating the auxiliary task loss between the auxiliary task prediction data and the labeling data of the training video about the auxiliary task by using the auxiliary task loss function. And then, according to the main task loss and the auxiliary task loss, adjusting parameters of a plurality of feature extraction models, a feature fusion model, a main task model and an auxiliary task model according to a value obtained by summing the main task loss and the auxiliary task loss according to a preset weight ratio, and training the video representation model.

Here, the primary task loss function is a loss function for a primary task, and a loss function corresponding to a primary task may be used and may be different depending on the primary task. The auxiliary task loss function is a loss function for an auxiliary task, and a loss function corresponding to an auxiliary task may be used and may be different according to the auxiliary task. According to an embodiment of the present disclosure, in the case where the primary task is a video classification task and the secondary task is also a video classification task, the loss functions of both may be cross entropy loss functions, that is:

Here, y _k denotes kth annotation data (i.e., a true class (label) of kth training video), p denotes kth prediction data (i.e., a prediction class (probability value) of kth training video), and N is the number of samples in the training dataset.

After the primary and secondary task losses are calculated, the primary and secondary task losses are summed at a predetermined weight ratio (e.g., 1:0.5) to yield losses for the video characterization model. Here, the weight ratio may be adjusted as needed. According to an embodiment of the present disclosure, the loss of the video characterization model may be calculated as follows:

Total Loss＝α×CELoss_{Image information}+β×CELoss_{Multimodal information}

Wherein CELoss _{Image information} is auxiliary task loss, CELoss _{Multimodal information} is main task loss.

In this way, the loss of the video characterization model is calculated. The parameters in each model are then updated by counter-propagating gradients in a manner that minimizes the loss until the video characterization model converges. Therefore, the weights of the main task and the auxiliary task in the training process of the video characterization model can be set according to the requirements, so that the expected video characterization model is obtained.

According to the training method of the video characterization model, the model training is performed by executing proper main tasks and auxiliary tasks at the downstream of the video characterization model, so that a better video characterization model can be obtained. By utilizing the video characterization model, the video characteristics can be more completely and fully characterized, and when the modal information of the video is seriously lost, the video characteristics can be generated more reliably by more depending on the modal information related to the auxiliary task, so that the business capability of the downstream task can be effectively improved.

As shown in fig. 3, the video characterization model is a portion surrounded by a dashed box, and the primary task model and the secondary task model outside the dashed box are added for training the video characterization model. In the video characterization model, a plurality of modal information (for example, a cover image, a multi-frame image, text information, audio information, and other information) contained in the training video is respectively input into a corresponding feature extraction model (for example, resNet, resNet, BERT, an MLP model, and other models), so that features (for example, a cover image feature, a multi-frame image feature, a text feature, an audio feature, and other information features) of the plurality of modal information are respectively obtained. And then, respectively inputting the characteristics of the multi-mode information into a self-attention mechanism model, mapping the characteristics of the multi-mode information into the same characteristic space, and then outputting the characteristics into a characteristic fusion model for a main task to obtain multi-mode fusion characteristics. In addition, since the auxiliary task is a task based on image information, the mapped cover image features and multi-frame image features are also input into a feature fusion model for the auxiliary task to obtain image information fusion features. Finally, the primary task model (i.e., video classification model) classifies based on the multimodal fusion feature and the secondary task model (i.e., video classification model) classifies based on the image information fusion feature.

According to an exemplary embodiment of the present disclosure, the auxiliary tasks employed in the training method of the video characterization model shown in fig. 2 may include n auxiliary tasks, where 1 n is equal to or less than the kind of modality information among the plurality of modality information. As shown in fig. 4, fig. 4 is a flowchart illustrating a training method of a video characterization model according to another exemplary embodiment of the present disclosure. Wherein the same steps as those in fig. 2 are denoted by the same reference numerals. Here, only the steps different from fig. 2 will be described.

In the case that the auxiliary tasks include n auxiliary tasks, step S410 acquires the training video, the annotation data of the training video about the main task, and n annotation data of the training video about corresponding auxiliary tasks among the n auxiliary tasks, respectively. Here, the n auxiliary tasks may be the same auxiliary task, for example, may be video classification tasks. But the modality information associated with each auxiliary task is different. That is, the n auxiliary tasks are tasks based on features of different modality information among the plurality of modality information, respectively. For example, in the case where the multi-modality information is image information, audio information, text information, the n auxiliary tasks may be 3 auxiliary tasks, and the 3 auxiliary tasks may be a task based on a feature of the image information, a task based on a feature of the audio information, and a task based on a feature of the text information, respectively. In this case, the n annotation data of the training video regarding the corresponding auxiliary task of the n auxiliary tasks may be annotation data of the training video regarding the task based on the feature of the image information, annotation data of the training video regarding the task based on the feature of the audio information, and annotation data of the training video regarding the task based on the feature of the text information. It will be appreciated that in the case where the 3 auxiliary tasks are video classification tasks, the 3 annotation data for the 3 auxiliary tasks may be the same or different, respectively.

In step S460, the features of the n kinds of modal information related to the n kinds of auxiliary tasks among the features of the multiple kinds of modal information of the training video extracted by the multiple feature extraction models are respectively input into corresponding auxiliary task models among the n auxiliary task models, and n auxiliary task prediction data of the training video are obtained by the n auxiliary task models. According to the embodiment of the disclosure, in the case that the multi-modal information is image information, audio information, text information, and n is 3, the features of the image information, the audio information, and the text information respectively related to the 3 auxiliary tasks may be respectively input into the respectively corresponding auxiliary task models. Specifically, the features of the image information may be input into a model of the task based on the features of the image information, the features of the audio information may be input into a model of the task based on the features of the audio information, and the features of the text information may be input into a model of the task based on the features of the text information. Thus, 3 auxiliary task prediction data are obtained through the three auxiliary task models.

In step S470, parameters of the plurality of feature extraction models, the feature fusion model, the main task model, and the n auxiliary task models are adjusted based on the main task prediction data and the annotation data of the training video for the main task, the n auxiliary task prediction data obtained in step S460, and the n annotation data of the training video for the corresponding auxiliary task of the n auxiliary tasks, respectively, and the video characterization model is trained.

According to an exemplary embodiment of the present disclosure, step S470 may include: and calculating main task losses between main task prediction data and labeling data of the training video, which are related to main tasks, by using the main task loss function, and calculating auxiliary task losses between corresponding auxiliary task prediction data in the n auxiliary task prediction data and corresponding labeling data in the n labeling data of the training video, which are related to the corresponding auxiliary tasks in the n auxiliary tasks, respectively, by using the n auxiliary task loss functions, so as to obtain n auxiliary task losses. And then, according to the values obtained by summing the main task loss and the n auxiliary task losses according to the preset weight proportion, adjusting parameters of a plurality of feature extraction models, a feature fusion model, a main task model and n auxiliary task models, and training the video characterization model. According to an embodiment of the present disclosure, in the case where the multi-modality information is image information, audio information, text information, and 3 auxiliary tasks, there is one auxiliary task loss function for each auxiliary task, and auxiliary task losses between auxiliary task prediction data corresponding to the auxiliary task and annotation data about the auxiliary task are calculated using the auxiliary task loss function. For example, auxiliary task prediction data obtained by a model of the task based on the features of the image information and annotation data of the training video regarding the task based on the features of the image information are calculated using a loss function of the task based on the features of the image information. In addition, when the main task loss and the n auxiliary task losses are summed with a predetermined weight ratio, an appropriate weight ratio may be set for the main task loss and the n auxiliary task losses as needed to make the video characterization model better characterize the video. Here, the model to be adjusted also includes n auxiliary task models. Therefore, the video characterization model is trained by utilizing the auxiliary tasks and the main tasks at the same time, the extraction capability of the video characterization model for the characteristics of the mode information related to each auxiliary task can be enhanced, and the obtained video characterization model can more accurately characterize the video.

Fig. 5 is a flowchart illustrating a video classification method based on a video characterization model according to an exemplary embodiment of the present disclosure. As shown in fig. 5, in step S510, a video to be classified is acquired. Here, the video to be classified may be any video. As one example, the video to be classified may be a video containing multi-modality information. The video to be classified may also be video having only single-mode information. The video to be classified may be acquired by various methods.

In step S520, the video to be classified is input into a video characterization model, and the multi-modal fusion feature of the video to be classified is obtained from the video characterization model. Here, the video characterization model is a model trained by the training method shown in fig. 2. According to embodiments of the present disclosure, the video characterization model may be a model trained from video classification tasks as the primary tasks. In this way, the video characterization model is a model trained by taking the video classification task as a main task, so that parameters in the video characterization model are adjusted for the video classification task, and the video classification result using the video characterization model can be more accurate. According to the embodiment of the disclosure, in the case that the video to be classified is a video only with single-mode information, the video characterization model extracts features only based on the single-mode information and performs fusion, and at this time, the single-mode fusion features of the video to be classified are used as multi-mode fusion features obtained by the video characterization model.

In step S530, the video to be classified is classified based on the multimodal fusion feature obtained in step S520. Here, the video to be classified may be classified by various methods. According to the embodiment of the disclosure, the multi-mode fusion characteristic can be input into the video classification model, so that a classification result of the video to be classified is obtained. Here, the video classification model is a pre-trained model. Various video classification models may be used to classify videos to be classified. As one example, the video classification model may be a model that is used as a primary task model in training a video characterization model, such as a Softmax model.

According to the video classification method based on the video characterization model, since the video characterization model is a model obtained through the downstream combined training of the main task and the auxiliary task, the video characteristics can be more completely and fully represented through the video characterization model, further, the video classification can be more accurately performed, the video characteristics can be more reliably represented even when the modal information of the video is seriously lost, and therefore the accurate video classification can be performed.

Fig. 6 is a flowchart illustrating a video clustering recommendation method based on a video characterization model according to an exemplary embodiment of the present disclosure. As shown in fig. 6, in step S610, a video to be recommended is acquired. Here, the video to be recommended may be any video. As one example, the video to be recommended may be a video containing a plurality of modality information. The video to be recommended may also be a video having only single-mode information. The video to be recommended may be acquired by various methods.

In step S620, the video to be recommended is input into the video characterization model, and the multimodal fusion feature of the video to be recommended is obtained from the video characterization model. Here, the video characterization model is a model trained by the training method shown in fig. 2. According to embodiments of the present disclosure, the video characterization model may be a model trained from video cluster recommendation tasks as the primary tasks. In this way, the video characterization model is a model obtained by training with the video clustering recommendation task as a main task, so that parameters in the video characterization model are adjusted for the video clustering recommendation task, and the video clustering recommendation result using the video characterization model can be more accurate. According to the embodiment of the disclosure, in the case that the video to be recommended is a video only with single-mode information, the video characterization model extracts features only based on the single-mode information and performs fusion, and at this time, the multi-mode fusion features obtained by the video characterization model are the single-mode fusion features of the video to be recommended.

In step S630, the videos to be recommended are clustered and recommended based on the multimodal fusion feature obtained in step S620. Here, videos to be recommended may be clustered and recommended by various methods. According to the embodiment of the disclosure, the multi-mode fusion characteristic can be input into the video clustering recommendation model, so that a clustering recommendation result of the video to be recommended is obtained. Here, the video clustering recommendation model is a pre-trained model. Various video clustering recommendation models can be used for clustering and recommending videos to be recommended. As one example, the video clustering recommendation model may be a model that is used as a primary task model in training a video characterization model.

According to the video clustering recommendation method based on the video characterization model, since the video characterization model is a model obtained through downstream main task and auxiliary task combined training, video characteristics can be more completely and fully represented through the video characterization model, video clustering recommendation can be more accurately carried out, video characteristics can be more reliably represented even when the modal information of a video is seriously lost, and therefore accurate video clustering recommendation can be carried out.

Fig. 7 is a flowchart illustrating a video search method based on a video characterization model according to an exemplary embodiment of the present disclosure. As shown in fig. 7, text information for video search is acquired and features of the text information are calculated at step S710. Here, the text information used for the video search may be any text information input by the user. The text information for video search may be acquired through various methods. And the characteristics of the text information may be calculated by various methods. According to embodiments of the present disclosure, a model that extracts features of textual information may be utilized to derive the features of the textual information.

In step S720, the multimodal fusion feature of the video is obtained using the video characterization model. Here, the video characterization model is a model trained by the training method shown in fig. 2. According to embodiments of the present disclosure, the video characterization model may be a model trained from video search tasks as the primary tasks. In this way, the video characterization model is a model trained by taking the video searching task as a main task, so that parameters in the video characterization model are adjusted for the video searching task, and the video searching result using the video characterization model can be more accurate. Further, the video from which the multimodal fusion feature is obtained may be a video in a database that is searched. The video may be a video containing a plurality of types of modal information, or may be a video having only single-mode information (text information).

In step S730, the multimodal fusion feature and the feature of the text information for video search are compared, and a video with high similarity is determined as a search result. Here, the multi-modal fusion features of all videos in the searched database can be sequentially acquired through the video characterization model, and then sequentially compared with the features of the text information used for video searching, so that videos with higher similarity are obtained and used as search results.

According to the video searching method based on the video characterization model, the video characterization model is a model obtained through the downstream combined training of the main task and the auxiliary task, so that the video characteristics can be more completely and fully characterized through the video characterization model, in addition, the parameters in the video characterization model are obtained through optimization, the characteristics can be extracted more accurately, and further the result of the video searching method based on the video characterization model is more accurate.

Fig. 8 is a block diagram illustrating a training apparatus of a video characterization model according to an exemplary embodiment of the present disclosure. The training device is a device for realizing the training method of the video characterization model.

Here, the video characterization model may include a plurality of feature extraction models, and a feature fusion model, which correspond to the plurality of modality information, respectively. According to exemplary embodiments of the present disclosure, the plurality of modality information may include, but is not limited to, any two or more of image information extracted from video, text information, audio information, geographical location information, time information, and the like. The image information may include a cover image of the video, a plurality of key frame images in the video, and the like, the text information may include characters such as subtitles in the video, and the audio information may include audio such as various dubbing in the video. The feature extraction model is used to extract features from the modality information. The feature fusion model is used for combining a plurality of features into one feature with more discrimination capability than the input feature. It will be appreciated that the feature output by the feature fusion model is the output of the video characterization model. The video characterization model of the present disclosure may use various feature extraction models and feature fusion models in the related art.

As shown in fig. 8, the training apparatus 800 for a video characterization model includes: a first acquisition unit 810, a second acquisition unit 820, a feature extraction unit 830, a feature fusion unit 840, a primary task prediction unit 850, a secondary task prediction unit 860, and a training unit 870. The first acquisition unit 810 is configured to: and acquiring training videos, annotation data of the training videos about main tasks and annotation data of the training videos about auxiliary tasks. Here, the training video, the annotation data for the main task of the training video, and the annotation data for the auxiliary task of the training video are acquired as a training data set. The main task is a task based on the characteristics of the multi-modal information, and the auxiliary task is a task based on the characteristics of one of the multi-modal information.

According to an exemplary embodiment of the present disclosure, the primary task may be one of a video classification task, a video clustering recommendation task, and a video search task, and the secondary task may be one of a video classification task, a video clustering recommendation task, a video search task, a task of classifying audio information in a video, a task of inferring a context using text information.

The second acquisition unit 820 is configured to: the training video acquired from the first acquisition unit 810 acquires a plurality of modality information contained therein. Here, according to an embodiment of the present disclosure, in the case where the training video contains image information, audio information, and text information, then the image information, the audio information, and the text information may be acquired from the training video, respectively.

The feature extraction unit 830 is configured to: the plurality of modal information acquired by the second acquisition unit 820 are respectively input into the corresponding feature extraction models, and features of the plurality of modal information of the training video are extracted by the plurality of feature extraction models. Here, according to the embodiment of the present disclosure, in the case where acquired multi-modality information is image information, audio information, and text information, the image information may be input into a feature extraction model for extracting image features, the audio information may be input into a feature extraction model for extracting audio features, and the text information may be input into a feature extraction model for extracting text features, whereby features of the image information, features of the audio information, and features of the text information are obtained by the three feature extraction models, respectively.

The feature fusion unit 840 is configured to: features of the multiple modal information extracted by the feature extraction unit 830 are input into a feature fusion model, and the multi-modal fusion features of the training video are obtained by the feature fusion model. Here, according to the embodiment of the present disclosure, in the case where the features of the plurality of modality information are the features of the image information, the features of the audio information, and the features of the text information, the features of the image information, the features of the audio information, and the features of the text information may all be input to the feature fusion model.

Further, in the case where the features of the various modality information are extracted by feature extraction models having different properties, the extracted feature spaces may be caused to be inconsistent, and thus, according to the exemplary embodiment of the present disclosure, the feature fusion unit 840 is configured to: before the features of the multi-modal information are input into the feature fusion model, the features of the multi-modal information can be mapped to the same feature space respectively, and then the mapped features of the multi-modal information are input into the feature fusion model. As an example, features of the multi-modal information may be mapped to the same feature space through a plurality of Self Attention (Self Attention) mechanisms, respectively, and then the features of the mapped multi-modal information are fused by using a feature fusion model. Thus, the features of the plurality of modal information can be fused more effectively.

The main task prediction unit 850 is configured to: the multimodal fusion features obtained by the feature fusion unit 840 are input into a primary task model, from which primary task prediction data of the training video is obtained. Here, in case that the primary task is a video classification task, the primary task model may be a model for video classification, and the primary task prediction data may be a result obtained by classification by the model for video classification, according to an embodiment of the present disclosure. The main task is a task based on characteristics of various modal information. That is, in the case where the plurality of modality information is image information, audio information, and text information, the main task may be performed based on the features of the image information, the features of the audio information, and the features of the text information.

The auxiliary task prediction unit 860 is configured to: and inputting the characteristic of one mode information related to the auxiliary task in the characteristics of the multi-mode information of the training video extracted by the plurality of characteristic extraction models into an auxiliary task model, and obtaining auxiliary task prediction data of the training video by the auxiliary task model. Here, one type of modality information related to the auxiliary task is one type of modality information among a plurality of types of modality information required for executing the auxiliary task. According to the embodiment of the disclosure, in the case that the multi-modal information can be image information, audio information and text information, if the auxiliary task is a task based on the characteristics of the image information, the characteristics of the image information are input into an auxiliary task model, and auxiliary task prediction data of a training video is obtained by the auxiliary task model. Thus, the capability of the video characterization model to extract features of image information can be enhanced by the task (auxiliary task) based on the features of the image information. Similarly, in the case where the auxiliary task is a task based on the characteristics of the audio information, the capability of the video characterization model to extract the characteristics of the audio information can be enhanced.

According to an embodiment of the present disclosure, in a case where the auxiliary task is a video classification task based on features of image information, the auxiliary task model may be a video classification model that performs video classification based on the features of the image information, and the auxiliary task prediction data may be a result of the auxiliary task model performing video classification based on the features of the image information. In the case where the primary task is also a video classification task, the secondary task prediction data may be the same as or different from the primary task prediction data. As described above, the auxiliary task may be one of a video classification task based on characteristics of any one of a plurality of modality information, a video clustering recommendation task, a video search task, a task of classifying audio information in a video, and a task of inferring a context using text information. Thus, the auxiliary task model is a model for realizing corresponding tasks based on the characteristics of the corresponding modal information, and corresponding auxiliary task prediction data is obtained.

The training unit 870 is configured to: the parameters of the feature extraction model, the feature fusion model, the main task model and the auxiliary task model are adjusted based on the main task prediction data obtained by the main task prediction unit 850 and the labeling data about the main task of the training video obtained by the first acquisition unit 810, the auxiliary task prediction data obtained by the auxiliary task prediction unit 860 and the labeling data about the auxiliary task of the training video obtained by the first acquisition unit 810, and the video characterization model is trained.

According to an example embodiment of the present disclosure, training unit 870 may be configured to: and calculating the main task loss between the main task prediction data and the labeling data of the training video about the main task by using the main task loss function, and calculating the auxiliary task loss between the auxiliary task prediction data and the labeling data of the training video about the auxiliary task by using the auxiliary task loss function. And then, according to the main task loss and the auxiliary task loss, adjusting parameters of a plurality of feature extraction models, a feature fusion model, a main task model and an auxiliary task model according to a value obtained by summing the main task loss and the auxiliary task loss according to a preset weight ratio, and training the video representation model.

Here, the primary task loss function is a loss function for a primary task, and a loss function corresponding to a primary task may be used and may be different depending on the primary task. The auxiliary task loss function is a loss function for an auxiliary task, and a loss function corresponding to an auxiliary task may be used and may be different according to the auxiliary task.

After the primary and secondary task losses are calculated, the primary and secondary task losses are summed at a predetermined weight ratio (e.g., 1:0.5) to yield losses for the video characterization model. Here, the weight ratio may be adjusted as needed.

The parameters in each model are then updated by counter-propagating gradients in a manner that minimizes the loss until the video characterization model converges.

According to the training device of the video characterization model, the model training is performed by performing proper main tasks and auxiliary tasks at the downstream of the video characterization model, so that a better video characterization model can be obtained. By utilizing the video characterization model, the video characteristics can be more completely and fully characterized, and when the modal information of the video is seriously lost, the video characteristics can be generated more reliably by more depending on the modal information related to the auxiliary task, so that the business capability of the downstream task can be effectively improved.

According to an exemplary embodiment of the present disclosure, the auxiliary tasks employed in the training apparatus of the video characterization model shown in fig. 8 may include n auxiliary tasks, where 1 n is equal to or less than the kind of modality information among the plurality of modality information.

In the case where the auxiliary tasks include n auxiliary tasks, the first acquisition unit 810 is configured to: the method comprises the steps of obtaining a training video, annotation data of the training video about a main task and n annotation data of the training video about corresponding auxiliary tasks in the n auxiliary tasks respectively. Here, the n auxiliary tasks may be the same auxiliary task, for example, may be video classification tasks. But the modality information on which each auxiliary task is based is different. That is, the n auxiliary tasks are tasks based on features of different modality information among the plurality of modality information, respectively. For example, in the case where the multi-modality information is image information, audio information, text information, the n auxiliary tasks may be 3 auxiliary tasks, and the 3 auxiliary tasks may be a task based on a feature of the image information, a task based on a feature of the audio information, and a task based on a feature of the text information, respectively. In this case, the n annotation data of the training video regarding the corresponding auxiliary task of the n auxiliary tasks may be annotation data of the training video regarding the task based on the feature of the image information, annotation data of the training video regarding the task based on the feature of the audio information, and annotation data of the training video regarding the task based on the feature of the text information. It will be appreciated that in the case where the 3 auxiliary tasks are video classification tasks, the 3 annotation data for the 3 auxiliary tasks may be the same or different, respectively.

The auxiliary task prediction unit 860 is configured to: the method comprises the steps of inputting the features of n kinds of modal information respectively related to n kinds of auxiliary tasks in the features of a plurality of modal information of a training video extracted by a plurality of feature extraction models into corresponding auxiliary task models in the n auxiliary task models, and obtaining n kinds of auxiliary task prediction data of the training video by the n auxiliary task models. According to the embodiment of the disclosure, in the case that the multi-modal information is image information, audio information, text information, and n is 3, the features of the image information, the audio information, and the text information respectively related to the 3 auxiliary tasks may be respectively input into the respectively corresponding auxiliary task models. Specifically, the features of the image information may be input into a model of the task based on the features of the image information, the features of the audio information may be input into a model of the task based on the features of the audio information, and the features of the text information may be input into a model of the task based on the features of the text information. Thus, 3 auxiliary task prediction data are obtained through the three auxiliary task models.

The training unit 870 is configured to: the parameters of the feature extraction models, the feature fusion model, the main task model and the n auxiliary task models are adjusted based on the main task prediction data, the labeling data of the training video about the main task, the n auxiliary task prediction data obtained by the auxiliary task prediction unit 860, and the n labeling data of the training video about the corresponding auxiliary task in the n auxiliary tasks, respectively, obtained by the first obtaining unit 810, so as to train the video characterization model.

According to an example embodiment of the present disclosure, training unit 870 may be configured to: and calculating main task losses between main task prediction data and labeling data of the training video, which are related to main tasks, by using the main task loss function, and calculating auxiliary task losses between corresponding auxiliary task prediction data in the n auxiliary task prediction data and corresponding labeling data in the n labeling data of the training video, which are related to the corresponding auxiliary tasks in the n auxiliary tasks, respectively, by using the n auxiliary task loss functions, so as to obtain n auxiliary task losses. And then, according to the values obtained by summing the main task loss and the n auxiliary task losses according to the preset weight proportion, adjusting parameters of a plurality of feature extraction models, a feature fusion model, a main task model and n auxiliary task models, and training the video characterization model. According to an embodiment of the present disclosure, in the case where the multi-modality information is image information, audio information, text information, and 3 auxiliary tasks, there is one auxiliary task loss function for each auxiliary task, and auxiliary task losses between auxiliary task prediction data corresponding to the auxiliary task and one annotation data regarding the auxiliary task are calculated using the auxiliary task loss function. For example, auxiliary task loss between auxiliary task prediction data obtained by a model of a task based on features of image information and annotation data about the task based on the features of image information is calculated using a loss function of the task based on the features of image information. In addition, when the main task loss and the n auxiliary task losses are summed with a predetermined weight ratio, an appropriate weight ratio may be set for the main task loss and the n auxiliary task losses as needed to make the video characterization model better characterize the video. Here, the model to be adjusted also includes n auxiliary task models. Therefore, the video characterization model is trained by utilizing the auxiliary tasks and the main tasks at the same time, the extraction capability of the video characterization model for the characteristics of the mode information related to each auxiliary task can be enhanced, and the obtained video characterization model can more accurately characterize the video.

Fig. 9 is a block diagram illustrating a video classification device based on a video characterization model according to an exemplary embodiment of the present disclosure. The video classification apparatus is an apparatus for implementing the video classification method shown in fig. 5. As shown in fig. 9, the video classification apparatus 900 includes an acquisition unit 910, a video characterization unit 920, and a video classification unit 930. Wherein the acquisition unit 910 is configured to: and acquiring videos to be classified. Here, the video to be classified may be any video. As one example, the video to be classified may be a video containing multi-modality information. The video to be classified may also be video having only single-mode information.

The video characterization unit 920 is configured to: inputting the video to be classified into a video characterization model, and obtaining the multi-mode fusion characteristics of the video to be classified by the video characterization model. Here, the video characterization model is a model trained by the training method shown in fig. 2. According to embodiments of the present disclosure, the video characterization model may be a model trained from video classification tasks as the primary tasks. In this way, the video characterization model is a model trained by taking the video classification task as a main task, so that parameters in the video characterization model are adjusted for the video classification task, and the video classification result using the video characterization model can be more accurate.

The video classification unit 930 is configured to: the video to be classified is classified based on the multimodal fusion feature obtained by the video characterization unit 920. Here, the video to be classified may be classified by various methods. According to the embodiment of the disclosure, the multi-mode fusion characteristic can be input into the video classification model, and the classification result of the video to be classified is obtained by the video classification model. Here, the video classification model is a pre-trained model. Various video classification models may be used to classify videos to be classified. As one example, the video classification model may be a model that is used as a primary task model in training a video characterization model, such as a Softmax model.

According to the video classification device based on the video characterization model, the video characterization model is a model obtained through the downstream combined training of the main task and the auxiliary task, so that the video characteristics can be more completely and fully characterized through the video characterization model, further, the video classification can be more accurately performed, the video characteristics can be more reliably characterized even when the modal information of the video is seriously lost, and the accurate video classification can be performed.

Fig. 10 is a block diagram illustrating a video clustering recommendation device based on a video characterization model according to an exemplary embodiment of the present disclosure. The video cluster recommendation device is a device for realizing the video cluster recommendation method shown in fig. 6. As shown in fig. 10, the video cluster recommendation apparatus 1000 includes: an acquisition unit 1010, a video characterization unit 1020, and a recommendation unit 1030. Wherein the acquisition unit 1010 is configured to: and acquiring the video to be recommended. Here, the video to be recommended may be any video. As one example, the video to be recommended may be a video containing a plurality of modality information. The video to be recommended may also be a video having only single-mode information.

The video characterization unit 1020 is configured to: inputting the video to be recommended into a video characterization model, and obtaining the multi-mode fusion characteristics of the video to be recommended by the video characterization model. Here, the video characterization model is a model trained by the training method shown in fig. 2. According to embodiments of the present disclosure, the video characterization model may be a model trained from video cluster recommendation tasks as the primary tasks. In this way, the video characterization model is a model obtained by training with the video clustering recommendation task as a main task, so that parameters in the video characterization model are adjusted for the video clustering recommendation task, and the video clustering recommendation result using the video characterization model can be more accurate.

The recommending unit 1030 is configured to: clustering and recommending the video to be recommended based on the multi-mode fusion characteristics obtained by the video characterization unit 1020. Here, videos to be recommended may be clustered and recommended by various methods. According to the embodiment of the disclosure, the multi-mode fusion characteristic can be input into the video clustering recommendation model, and the clustering recommendation result of the video to be recommended is obtained by the video clustering recommendation model. Here, the video clustering recommendation model is a pre-trained model. Various video clustering recommendation models can be used for clustering and recommending videos to be recommended.

According to the video clustering recommendation device based on the video characterization model, the video characterization model is a model obtained through downstream main task and auxiliary task combined training, so that video characteristics can be more completely and fully characterized through the video characterization model, further video clustering recommendation can be more accurately performed, even when the modal information of a video is seriously lost, the video characteristics can be more reliably characterized, and accurate video clustering recommendation can be performed.

Fig. 11 is a block diagram illustrating a video search device based on a video characterization model according to an exemplary embodiment of the present disclosure. The video search apparatus is an apparatus that implements the video search method shown in fig. 7. As shown in fig. 11, the video search apparatus 1100 includes: a first acquisition unit 1110, a second acquisition unit 1120, and a comparison determination unit 1130. Wherein the first acquisition unit 1110 is configured to: text information for video searching is acquired and features of the text information are calculated. Here, the text information used for the video search may be any text information input by the user. The text information for video search may be acquired through various methods. And the characteristics of the text information may be calculated by various methods. According to embodiments of the present disclosure, a model that extracts features of textual information may be utilized to derive the features of the textual information.

The second acquisition unit 1120 is configured to: and acquiring the multi-mode fusion characteristics of the video by using the video characterization model. Here, the video characterization model is a model trained by the training method shown in fig. 2. According to embodiments of the present disclosure, the video characterization model may be a model trained from video search tasks as the primary tasks. In this way, the video characterization model is a model trained by taking the video searching task as a main task, so that parameters in the video characterization model are adjusted for the video searching task, and the video searching result using the video characterization model can be more accurate. Further, the video from which the multimodal fusion feature is obtained may be a video in a database that is searched. The video may be a video containing a plurality of types of modal information, or may be a video having only single-mode information (text information).

The comparison determination unit 1130 is configured to: and comparing the multimodal fusion characteristics with characteristics of text information used for video searching, and determining videos with high similarity as search results. Here, the multi-modal fusion features of all videos in the searched database can be sequentially acquired through the video characterization model, and then sequentially compared with the features of the text information used for video searching, so that videos with higher similarity are obtained and used as search results.

According to the video searching device based on the video characterization model, the video characterization model is a model obtained through the downstream combined training of the main task and the auxiliary task, so that the video characteristics can be more completely and fully characterized through the video characterization model, in addition, the parameters in the video characterization model are obtained through optimization, the characteristics can be extracted more accurately, and further, the result of the video searching method based on the video characterization model is more accurate.

It should be appreciated that various apparatuses according to exemplary embodiments of the present disclosure may perform the corresponding methods described above, and are not described herein in detail to avoid repetition. Only the description as a device is retained.

Fig. 12 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure. The electronic device 1200 includes at least one memory 1210 having stored therein a set of computer-executable instructions that, when executed by the at least one processor, perform a training method or a video classification method or a video cluster recommendation method or a video search method for a video characterization model according to an exemplary embodiment of the present disclosure, and at least one processor 1220.

By way of example, the electronic device 1200 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device 1200 is not necessarily a single electronic device, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction set) individually or in combination. The electronic device 1200 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).

In electronic device 1200, processor 1220 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 1220 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and so forth.

Processor 1220 may execute instructions or code stored in memory, where memory 1210 may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The memory 1210 may be integrated with the processor 1220, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, memory 1210 may include a stand-alone device, such as an external disk drive, a storage array, or other storage device usable by any database system. The memory 1210 and the processor 1220 may be operatively coupled or may communicate with each other, for example, through an I/O port, a network connection, etc., so that the processor 1220 can read files stored in the memory 1210.

In addition, the electronic device 1200 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein the instructions in the computer-readable storage medium, when executed by the at least one processor, cause the at least one processor to perform a training method or a video classification method or a video cluster recommendation method or a video search method of the video characterization model of the exemplary embodiment of the present disclosure. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card-type memories (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tapes, floppy disks, magneto-optical data storage devices, hard disks, solid state disks, and any other devices configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product is provided, comprising computer instructions which, when executed by a processor, implement a training method or a video classification method or a video cluster recommendation method or a video search method of the video characterization model of the exemplary embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A training method of a video characterization model, wherein the video characterization model includes a plurality of feature extraction models and a feature fusion model respectively corresponding to a plurality of modal information, the training method comprising:

acquiring a training video, annotation data of the training video about a main task and annotation data of the training video about an auxiliary task, wherein the main task is a task based on characteristics of the multi-modal information, and the auxiliary task is a task based on characteristics of one modal information in the multi-modal information;

acquiring the multi-modal information contained in the training video from the training video;

respectively inputting the multiple modal information into corresponding feature extraction models, and extracting the features of the multiple modal information of the training video by the multiple feature extraction models;

inputting the characteristics of the multi-mode information into the characteristic fusion model, and obtaining multi-mode fusion characteristics of the training video by the characteristic fusion model;

Inputting the multi-mode fusion characteristics into a main task model, and obtaining main task prediction data of the training video by the main task model;

Inputting the characteristic of one mode information related to the auxiliary task in the characteristics of the multiple mode information of the training video extracted by the multiple characteristic extraction models into an auxiliary task model, and obtaining auxiliary task prediction data of the training video by the auxiliary task model;

Adjusting parameters of the plurality of feature extraction models, the feature fusion model, the primary task model and the auxiliary task model based on the primary task prediction data and the annotation data of the training video about the primary task, the auxiliary task prediction data and the annotation data of the training video about the auxiliary task, and training the video characterization model;

wherein adjusting parameters of the plurality of feature extraction models, the feature fusion model, the primary task model, the auxiliary task model, and the training video based on the primary task prediction data and the annotation data for the primary task, the auxiliary task prediction data, and the annotation data for the auxiliary task for the training video, the step of training the video characterization model comprises:

calculating main task loss between the main task prediction data and the labeling data of the training video about main tasks by using a main task loss function, and calculating auxiliary task loss between the auxiliary task prediction data and the labeling data of the training video about auxiliary tasks by using an auxiliary task loss function;

Summing the primary task loss and the auxiliary task loss according to a preset weight proportion to obtain the loss of the video characterization model;

adjusting parameters of the feature extraction models, the feature fusion model, the main task model and the auxiliary task model according to the loss of the video characterization model, and training the video characterization model;

Wherein the primary task and the secondary task are of the same type.

2. The training method of claim 1, wherein the plurality of modal information includes any two or more of image information, text information, audio information, geographical location information, and time information.

3. The training method of claim 2, wherein the primary task is one of a video classification task, a video cluster recommendation task, and a video search task.

4. The training method of claim 3, wherein the auxiliary task is one of a video classification task, a video cluster recommendation task, a video search task, a task of classifying audio information in video, and a task of using text information to infer a context.

5. The training method of claim 1, wherein the auxiliary tasks include n auxiliary tasks, wherein 1 n is equal to or less than a category of modality information in the plurality of modality information.

6. The training method according to claim 1, wherein before inputting the features of the plurality of modal information into the feature fusion model, the features of the plurality of modal information are mapped to the same feature space, respectively, and then the mapped features of the plurality of modal information are input into the feature fusion model.

7. The training method of claim 4, wherein the plurality of modality information includes at least image information, and the auxiliary task is a task based on characteristics of the image information.

8. The training method of claim 4, wherein the plurality of modal information includes at least audio information, and the auxiliary task is a task based on characteristics of the audio information.

9. A video classification method based on a video characterization model, comprising:

Acquiring videos to be classified;

Inputting the video to be classified into a video characterization model trained by the training method according to any one of claims 1 to 8, and obtaining multi-modal fusion characteristics of the video to be classified from the video characterization model;

and classifying the videos to be classified based on the multi-mode fusion features.

10. The video classification method according to claim 9, wherein the step of classifying the video to be classified based on the multimodal fusion feature comprises:

And inputting the multi-mode fusion features into a video classification model to obtain a classification result of the video to be classified.

11. The video clustering recommendation method based on the video characterization model is characterized by comprising the following steps of:

acquiring a video to be recommended;

inputting the video to be recommended into a video characterization model trained by the training method according to any one of claims 1 to 8, and obtaining multi-modal fusion characteristics of the video to be recommended from the video characterization model;

and clustering and recommending the video to be recommended based on the multi-mode fusion characteristics.

12. A video search method based on a video characterization model, comprising:

Acquiring text information for video searching, and calculating characteristics of the text information;

Acquiring multi-modal fusion features of a video by using a video characterization model trained by a training method according to any one of claims 1 to 8;

And comparing the multimodal fusion characteristics with the characteristics of the text information, and determining the video with high similarity as a search result.

13. A training device for a video characterization model, wherein the video characterization model includes a plurality of feature extraction models and a feature fusion model, which respectively correspond to a plurality of modal information, the training device comprising:

A first acquisition unit configured to: acquiring a training video, annotation data of the training video about a main task and annotation data of the training video about an auxiliary task, wherein the main task is a task based on characteristics of the multi-modal information, and the auxiliary task is a task based on characteristics of one modal information in the multi-modal information;

A second acquisition unit configured to: acquiring the multi-modal information contained in the training video from the training video;

A feature extraction unit configured to: respectively inputting the multiple modal information into corresponding feature extraction models, and extracting the features of the multiple modal information of the training video by the multiple feature extraction models;

a feature fusion unit configured to: inputting the characteristics of the multi-mode information into the characteristic fusion model, and obtaining multi-mode fusion characteristics of the training video by the characteristic fusion model;

A primary task prediction unit configured to: inputting the multi-mode fusion characteristics into a main task model, and obtaining main task prediction data of the training video by the main task model;

An auxiliary task prediction unit configured to: inputting the characteristic of one mode information related to the auxiliary task in the characteristics of the multiple mode information of the training video extracted by the multiple characteristic extraction models into an auxiliary task model, and obtaining auxiliary task prediction data of the training video by the auxiliary task model;

a training unit configured to: adjusting parameters of the plurality of feature extraction models, the feature fusion model, the primary task model and the auxiliary task model based on the primary task prediction data and the annotation data of the training video about the primary task, the auxiliary task prediction data and the annotation data of the training video about the auxiliary task, and training the video characterization model;

wherein the training unit is configured to:

Wherein the primary task and the secondary task are of the same type.

14. The training device of claim 13, wherein the plurality of modality information comprises any two or more of image information, text information, audio information, geographic location information, time information.

15. The training device of claim 14, wherein the primary task is one of a video classification task, a video cluster recommendation task, a video search task.

16. The training device of claim 15, wherein the auxiliary task is one of a video classification task, a video cluster recommendation task, a video search task, a task that classifies audio information in video, and a task that uses textual information to infer a context.

17. The training device of claim 13, wherein the auxiliary tasks comprise n auxiliary tasks, wherein 1 n is equal to or less than a category of modality information in the plurality of modality information.

18. The training device of claim 13, wherein the feature fusion unit is configured to: before the features of the multi-modal information are input into the feature fusion model, the features of the multi-modal information are respectively mapped into the same feature space, and then the mapped features of the multi-modal information are input into the feature fusion model.

19. The training device of claim 16, wherein the plurality of modality information includes at least image information, and the auxiliary task is a task based on characteristics of the image information.

20. The training device of claim 16, wherein the plurality of modal information includes at least audio information, and wherein the auxiliary task is a task based on characteristics of the audio information.

21. A video classification device based on a video characterization model, comprising:

an acquisition unit configured to: acquiring videos to be classified;

A video characterization unit configured to: inputting the video to be classified into a video characterization model trained by the training method according to any one of claims 1 to 8, and obtaining multi-modal fusion characteristics of the video to be classified from the video characterization model;

a video classification unit configured to: and classifying the videos to be classified based on the multi-mode fusion features.

22. The video classification device of claim 21, wherein the video classification unit is configured to:

23. A video clustering recommendation device based on a video characterization model, comprising:

an acquisition unit configured to: acquiring a video to be recommended;

a video characterization unit configured to: inputting the video to be recommended into a video characterization model trained by the training method according to any one of claims 1 to 8, and obtaining multi-modal fusion characteristics of the video to be recommended from the video characterization model;

a recommendation unit configured to: and clustering and recommending the video to be recommended based on the multi-mode fusion characteristics.

24. A video search device based on a video characterization model, comprising:

A first acquisition unit configured to: acquiring text information for video searching, and calculating characteristics of the text information;

a second acquisition unit configured to: acquiring multi-modal fusion features of a video by using a video characterization model trained by a training method according to any one of claims 1 to 8;

a comparison determination unit configured to: and comparing the multimodal fusion characteristics with the characteristics of the text information, and determining the video with high similarity as a search result.

25. An electronic device, comprising:

A processor;

A memory for storing the processor-executable instructions;

Wherein the processor is configured to execute the instructions to implement the training method of the video characterization model of any one of claims 1 to 8 or the video classification method of any one of claims 9 to 10 or the video cluster recommendation method of claim 11 or the video search method of claim 12.

26. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, cause the electronic device to perform the training method of the video characterization model of any one of claims 1 to 8 or the video classification method of any one of claims 9 to 10 or the video clustering recommendation method of claim 11 or the video search method of claim 12.

27. A computer program product comprising computer instructions which, when executed by a processor, implement the training method of a video characterization model according to any one of claims 1 to 8 or the video classification method according to any one of claims 9 to 10 or the video cluster recommendation method according to claim 11 or the video search method according to claim 12.