CN116975615A

CN116975615A - Task prediction method and device based on video multi-mode information

Info

Publication number: CN116975615A
Application number: CN202211422492.4A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-10-31

Abstract

The application relates to a task prediction method, a task prediction device, computer equipment, a storage medium and a computer program product based on video multi-mode information. The method comprises the following steps: the method comprises the steps of obtaining multi-modal information of a video, respectively inputting the multi-modal information of the video into feature extraction networks of all dimensions of a target task model to obtain text features of at least one dimension and image features of at least one dimension, and outputting a prediction result of a target task according to the text features of at least one dimension and the image features of at least one dimension through a task prediction network of the target task model; the target task model is obtained by fine tuning a multi-mode pre-training model which is subjected to pre-training, the multi-mode pre-training model is obtained by pre-training a plurality of prediction tasks on the pre-training model by utilizing multi-mode pre-training data, and weak supervision signals of all the prediction tasks are determined based on multi-mode information of videos in the pre-training data. The method improves the online efficiency of the model.

Description

Task prediction method and device based on video multi-mode information

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a task prediction method, apparatus, computer device, storage medium, and computer program product based on video multimodal information.

Background

With the age of rapid development of the internet, machine learning technology is updated and developed continuously, and machine learning is the core of artificial intelligence and is a fundamental way for making computers have intelligence, and the machine learning technology is applied to various fields of artificial intelligence. Existing machine learning model iterative processes typically include the steps of: demand determination- > data acquisition- > model training- > model testing and online.

The neural network model training process needs to collect a large amount of data and a large amount of training time, samples of a plurality of models in the information flow service are very sparse in the service content, a large amount of samples are difficult to collect and store, and the collection and labeling cost is high, so that the neural network model training efficiency is low, and the model online efficiency is affected.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a task prediction method, apparatus, computer device, computer readable storage medium, and computer program product based on video multimodal information that can improve training efficiency.

In a first aspect, the present application provides a task prediction method based on video multimodal information. The method comprises the following steps:

Acquiring multi-modal information of a video, wherein the multi-modal information comprises text modal information of at least one dimension and image modal information of at least one dimension;

respectively inputting the multi-mode information of the video into feature extraction networks of each dimension of a target task model to obtain text features of at least one dimension and image features of at least one dimension, and outputting a prediction result of a target task according to the text features of at least one dimension and the image features of at least one dimension through a task prediction network of the target task model;

the target task model is obtained by fine tuning a multi-mode pre-training model after pre-training, the multi-mode pre-training model is obtained by pre-training a plurality of prediction tasks on the pre-training model by utilizing multi-mode pre-training data, and weak supervision signals of the prediction tasks are determined based on multi-mode information of videos in the pre-training data.

In a second aspect, the application further provides a task prediction device based on the video multi-mode information. The device comprises:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring multi-modal information of a video, and the multi-modal information comprises text modal information of at least one dimension and image modal information of at least one dimension;

The prediction module is used for respectively inputting the multi-mode information of the video into the feature extraction network of each dimension of the target task model to obtain text features of at least one dimension and image features of at least one dimension, and outputting a prediction result of the target task according to the text features of at least one dimension and the image features of at least one dimension through the task prediction network of the target task model;

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

According to the task prediction method, the device, the computer equipment, the storage medium and the computer program product based on the video multi-modal information, the multi-modal information of the video is input into the target task network, the prediction result of the target task can be obtained through the target task network, the target task model is obtained by fine tuning the pre-trained multi-modal pre-training model, so that feature learning corresponding to the target task can be achieved by fine tuning the pre-trained multi-modal pre-training model only by utilizing target training data required by the target task, the target task model is obtained, the sample quantity and the labeling cost for iteration of a service model can be effectively reduced in the training process of the target task model, the training efficiency of the target task model is improved, and the model online efficiency is further improved. In the pre-training process, the multi-modal model is pre-trained to obtain a plurality of prediction tasks, and the weak supervision signals of the prediction tasks are determined based on the multi-modal information of the video in the pre-training data, so that the pre-training mode is self-supervision training, a large number of supervision samples do not need to be marked manually, the learning cost is reduced, and the learning efficiency is improved.

Drawings

FIG. 1 is an application environment diagram of a task prediction method based on video multimodal information in one embodiment;

FIG. 2 is a flow chart of a task prediction method based on video multimodal information in one embodiment;

FIG. 3 is a schematic diagram of a structure of a target task model in one embodiment;

FIG. 4 is a flow diagram of a process for obtaining a target task model in one embodiment;

FIG. 5 is a flow chart of steps for determining weak supervisory signals for each predictive task and pre-training a pre-training model for a plurality of predictive tasks based on multi-modal information of a video in one embodiment;

FIG. 6 is a flowchart of a task prediction method based on video multimodal information in another embodiment;

FIG. 7 is an explanatory diagram of a task prediction method based on video multimodal information in one embodiment;

FIG. 8 is a schematic diagram of a video multi-modality information based task prediction system based on information flow in one embodiment;

FIG. 9 is a block diagram of a task prediction device based on video multimodal information in one embodiment;

fig. 10 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Deep learning: the concept of deep learning is derived from the study of artificial neural networks. The multi-layer sensor with multiple hidden layers is a deep learning structure. Deep learning forms more abstract high-level representation attribute categories or features by combining low-level features to discover distributed feature representations of data.

Video: the video recommended to the user by the platform comprises a vertical version of small video and a horizontal version of short video, and the video is provided in the form of information flow.

MCN (Multi-Channel Network): the method is a product form of a multi-channel network, combines PGC contents, and ensures continuous output of the contents under the powerful support of capital, thereby finally realizing stable realization of business.

PGC (full name: professional Generated Content) internet terminology refers to professional production content (video website), expert production content (microblog). Is used to refer broadly to content personalization, view angle diversification, and social relationship virtualization. Also known as PPC, (productive-produced Content).

Faiss: the Facebook AI team open source search library aims at clustering and similarity, provides efficient similarity search and clustering for dense vectors, supports billion-level vector search, and is the most mature approximate neighbor search library at present.

Information flow (feeds): also known as message source, continuously updates and presents to the information stream of user content. feeds are a content aggregator that combines several sources of messages actively subscribed to by a user, typically news websites and blogs, to help the user to continuously acquire the latest feed content. There are various display forms of feeds, and a main display form is a time line (timeline) and a rank (rank), wherein the timeline is a display mode of feeds, and the feeds are displayed to users according to the time sequence of updating the feeds; rank is the weight of content calculated according to some factors, so as to determine the sequence of content presentation.

Short video: i.e., short video, is a way of distributing internet content, typically video distributed over new internet media for a period of less than 5 minutes.

With the age of rapid development of the internet, the threshold for content production is reduced, and the release amount of various contents is increased at an exponential rate. These content sources come from various content authoring institutions, such as PGCs, UGC content from media and institutions. Video content has been rapidly developed through algorithmic distribution in information stream presentation. The content sources on the internet are very wide and very many, and various video contents are provided, and the quality levels of the channels of the content sources are uneven. In the face of massive and numerous contents, particularly contents with a large number of video quality and uneven levels can be distributed after being checked.

Before content distribution, the content is manually audited and marked, and video content with quality problems is directly filtered, but audit efficiency is low under the condition of very large content, so that the content can be identified manually through the assistance of a machine learning algorithm.

The existing iterative process of the information flow content processing algorithm model generally comprises the following steps: demand determination- > data acquisition- > model training- > model testing and online. In the above model training step, two elements are generally included: model structure design and corresponding pre-training model. Because the pre-training model needs to collect a large amount of data and needs a large amount of training time, the pre-training model is not additionally trained in general, but an existing pre-training model (image/coco and the like are directly utilized to pre-train the model based on picture data, and the video quality problem is converted into a picture problem of a video frame) and then corresponding task Fine adjustment (Fine-tuning) is performed. This is also a typical modeling approach for many streaming video content quality processing algorithm models today. In addition, the supervised learning modeling involves sample construction, samples of a plurality of models in information flow service are very sparse in service content, a large amount of samples are difficult to collect and store, and the collection and labeling cost is high, so that the main source approach of the current algorithm modeling samples is to obtain the samples through collecting content reported and negatively fed back by users and then manually rechecking; the corresponding samples are collected through actively and manually inspecting the content state on the line, but the overall efficiency is very low, and the mode of post-collection is always slower than the problem occurrence, and particularly the sample problem change and variation efficiency of coping with the video quality is very low.

The task prediction method based on the video multi-mode information provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the content production terminal 102, the content consumption terminal 106 communicate with the server 104 via a network. The content production terminal 102 provides content to a server, which server 104 can distribute content to content consumers 106. In order to improve the efficiency and quality of content delivery, the server 104 may use a neural network model to process, for example, determine the content to be recommended at the content consumption terminal by using a content recommendation model. The neural network model of the server can be obtained by fine tuning based on a pre-trained multi-mode pre-training model. Server 104 may implement a training method for a neural net model. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In one embodiment, as shown in fig. 2, a task prediction method based on video multi-mode information is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

step 202, obtaining multi-modal information of a video, wherein the multi-modal information comprises text modal information of at least one dimension and image modal information of at least one dimension.

The text modal information of at least one dimension, such as topic labels, titles, authors, text recognition results, voice recognition results and user comments of the video.

Wherein at least one dimension of the image modality information, such as the cover image and/or the key frame image. The cover image may be selected by the author at the time of distribution and if not selected, is typically the first frame of the video content.

Step 204, inputting the multi-modal information of the video into feature extraction networks of each dimension of the target task model respectively to obtain text features of at least one dimension and image features of at least one dimension, and outputting a prediction result of the target task according to the text features of at least one dimension and the image features of at least one dimension through a task prediction network of the target task model.

The target task model is obtained by fine tuning a multi-mode pre-training model which is subjected to pre-training, the multi-mode pre-training model is obtained by pre-training a plurality of prediction tasks on the multi-mode pre-training model by utilizing pre-training data, and weak supervision signals of all the prediction tasks are determined based on multi-mode information of videos in the pre-training data.

The network structure of the target task model of one embodiment is shown in fig. 3, and includes an image feature extraction network corresponding to image mode information of each dimension, a text feature extraction network corresponding to text mode information of each dimension, and a task prediction network.

The task prediction network may include text features extracted by the text feature extraction network, image features extracted by the image feature extraction network, and fusion features of the text features and the image features, where the text features and the image features required by the fusion features may be specified according to different prediction tasks, such as specifying features and labels of the fused cover pictures, and specifying features and labels of the key frame pictures.

The target task model structure is obtained by fine tuning on the basis of a multi-mode pre-training model with pre-training completed. After the multi-mode pre-training model is obtained after the pre-training is completed, the multi-mode pre-training model is used as a basis, and only target training data required by a target task is needed to be utilized to finely tune the multi-mode pre-training model after the pre-training is completed so as to realize the feature learning corresponding to the target task. Therefore, for the training process of the target task model, the sample size and the labeling cost for business model iteration can be effectively reduced, and the model research and development time is shortened.

As an application example of the application applied to the image-text retrieval scene, after the pre-trained multi-mode pre-training model is obtained, the multi-mode pre-training model is finely tuned according to the target training data of the image-text retrieval, and the task model of the image-text retrieval is obtained. When the method is applied, the multi-mode information of the video is respectively input into the feature extraction network of each dimension of the task model of the image-text retrieval to obtain the text feature of at least one dimension and the image feature of at least one dimension, and the task prediction network of the task model of the image-text retrieval outputs the prediction result of the image-text retrieval according to the text feature of at least one dimension and the image feature of at least one dimension.

As an application example of the method applied to content classification, after a pre-trained multi-mode pre-training model is obtained, the multi-mode pre-training model is subjected to fine adjustment according to target training data of the content classification, and a task model of the content classification is obtained. When the method is applied, the multi-mode information of the video is respectively input into the feature extraction network of each dimension of the task model of the content classification, the text features of at least one dimension and the image features of at least one dimension are obtained, and the task prediction network of the task model of the content classification outputs the prediction result of the content classification according to the text features of at least one dimension and the image features of at least one dimension.

As an application example of the method applied to content quality detection, after a pre-trained multi-mode pre-training model is obtained, the multi-mode pre-training model is subjected to fine adjustment according to target training data of content quality, and a task model of content quality detection is obtained. When the method is applied, the multi-mode information of the video is respectively input into the feature extraction network of each dimension of the task model for content quality detection to obtain text features of at least one dimension and image features of at least one dimension, and the task prediction network of the task model for content quality detection outputs a content quality prediction result according to the text features of at least one dimension and the image features of at least one dimension.

The multi-mode pre-training model is obtained by pre-training a plurality of prediction tasks on the multi-mode model by utilizing pre-training data, and weak supervision signals of all the prediction tasks are determined based on multi-mode information of videos in the pre-training data, so that the pre-training mode is self-supervision training, a large number of supervision samples do not need to be marked manually, the learning cost is reduced, and the learning efficiency is improved. Compared with the traditional supervised learning mode, the self-supervised learning mode can utilize Internet massive content data and real massive content data uploaded by users of all information platforms.

The plurality of prediction tasks may include one or more single-mode similarity training tasks, or one or more cross-mode self-supervision training tasks with text features as weak supervision signals, or cross-mode matching prediction tasks of images and texts.

According to the task prediction method based on the video multi-modal information, the multi-modal information of the video is input into the target task network, the prediction result of the target task can be obtained through the target task network, the target task model is obtained by fine adjustment of the multi-modal pre-training model which is achieved through pre-training, the feature learning corresponding to the target task can be achieved only by fine adjustment of the multi-modal pre-training model which is achieved through the target training data which is needed by the target task, and the target task model is obtained. In the pre-training process, the multi-modal model is pre-trained to obtain a plurality of prediction tasks, and the weak supervision signals of the prediction tasks are determined based on the multi-modal information of the video in the pre-training data, so that the pre-training mode is self-supervision training, a large number of supervision samples do not need to be marked manually, the learning cost is reduced, and the learning efficiency is improved.

In another embodiment, as shown in fig. 4, the obtaining manner of the target task model includes:

step 402, pre-training data is obtained, the pre-training data comprising multi-modal information of video samples.

The pre-training data is training data required by the multi-mode pre-training model for pre-training. Wherein the pre-training data may be obtained from the service. Taking the service as a video information flow service as an example, the multi-mode information of the video can be obtained from the video information flow as pre-training data.

The multimodal information includes text modality information of at least one dimension, such as a topic label, title, author, text recognition result, speech recognition result, and user comment of the video.

Video content topic tags (hashatag), which determine the accuracy of the traffic from the video. In short videos of a streaming content platform, publishers typically describe the content of the short video in some way. In the description, many users add "#" and text, and such descriptive language is called a topic label, which can be a perfect source of model training data. The topic labels are key words which are considered by video content creators, and can well express the key words or core semantic information.

The topic label in the information flow business has the following characteristics:

(1) And the method is inexhaustible. For example, the number of topic labels accumulated on a platform in the past year is more than 3000 tens of thousands, and the topic labels are widely covered and are representative.

(2) The content is relatively balanced with respect to distribution, basically reflects the distribution condition of the content, and various types of video content have corresponding topic labels.

(3) Has higher accuracy. The relevance of the topic label and the work is manually input by the original author, and then the relevance can be further improved after the author grade and the enhancement rule of the word stock are cleaned (such as filtering out the stop words and the word of the language and the gas and the coincidence of various titles and the like).

The topic tag can be regarded as a keyword in the SEO (Search Engine Optimization ), and typical information flow platforms all have such keywords, so when a user searches for a topic tag, such as # -new year blessing #, if a video content happens to have this tag, the content will appear in the latest tag search result. In order to make the corresponding content of these topic labels more representative and generalized, the topic labels are also used as the query output of the search engine, and a part of the corresponding video content and the structural information of the video content, such as title, author information and the like, are searched and queried from the internet.

In addition to the cover map (usually chosen by the author when released, if not, the default is usually the first frame map of the video content), the video content itself contains a large amount of text description information, which is usually a semantic description of the video content, and can correspond to the video content, and a large amount of text of the video content after the auxiliary model extraction, processing and cleaning can be used as a pre-training data source. Specifically: the video text modality information may also include a title, which is a subjective description of the content of the video presentation by the publisher, typically covering the high-level semantics of the video presentation. In practice, it is found that many videos have no title or insufficient information conveyed by the title, and text modal information of the video may be a text recognition result, where the text recognition result may be obtained by recognizing text information in the video by using OCR technology. The text modal information of the video can also be a speech recognition result, and the speech recognition result is obtained by converting text by using ASR speech. OCR data has some problems such as: in the picture switching process, OCR is inaccurate in recognition, fixed-position OCR needs to be duplicated, dictation OCR needs to be reserved, and news scrolling OCR needs to be deleted. Therefore, the denoising processing is performed on the OCR recognition result (the description information of the ASR speech to text or video capture is similar), and includes filtering single-word/pure-number/pure-letter OCR, filtering OCR with small offset of adjacent two frames bbox and high character repetition rate, filtering OCR with bbox at the bottom of the screen and small height, and the like. And splicing the denoised extracted text and the title text for one-path text mode input.

The text modal information of the video can also be user comments, and the user comments are longer in content as a whole relative to the topic label. The comment area is an aggregation place where users share ideas and views of video works, each work can have multiple comments, and the average number of comments of high-quality works can exceed thousands; unlike the topic labels generated by authors, reviews are directly produced by users, truly reflecting the user's understanding of the work. Meanwhile, the comments are descriptive of the content of the work and describe the subject of the work from multiple angles, and before the comments are used, the low-quality comments in the comment content are filtered by using a corresponding quality model, and then the meaningless contents of short pure letters and numbers are filtered.

The multimodal information also includes image modality information of at least one dimension, such as a cover image and/or a key frame image. The cover image may be selected by the author at the time of distribution and if not selected, is typically the first frame of the video content.

After multi-mode information of all video samples collected from multiple channels is preprocessed, for example, an album packet is an API specially written for data enhancement, and basically contains a large number of data enhancement means, such as rotation, clipping, gaussian noise, masking, color conversion, filtering and other picture operations, and can be directly used for picture preprocessing), the data enhancement processing is performed on the data enhancement packet, the data enhancement and change processing is performed on the data enhancement packet, and the changed version can be used for self-supervision learning.

Step 404, determining weak supervisory signals of each prediction task based on multi-modal information of the video sample, and performing pre-training of a plurality of prediction tasks on the pre-training model.

In this embodiment, the multimodal model is pre-trained by multiple prediction tasks to learn the commonality characteristics of the video samples. In the pre-training process, the weak supervision signals of all the prediction tasks are determined based on the multi-mode information of the video samples, so that the pre-training mode is self-supervision training, a large number of supervision samples are not required to be marked manually, the learning cost is reduced, and the learning efficiency is improved. Compared with the traditional supervised learning mode, the self-supervised learning mode can utilize Internet massive content data and real massive content data uploaded by users of all information platforms.

As an example of determining weak supervisory signals of each prediction task for pre-training based on multi-modal information, disturbance processing may be performed on text information of a video sample, and similarity between the text information before disturbance and the text information after disturbance may be used as a weak supervisory signal for comparison learning.

As an example of determining weak supervisory signals of each prediction task for pre-training based on multi-modal information, disturbance processing may be performed on the image information, and similarity between the image information before disturbance and the image information after disturbance may be used as the weak supervisory signals for comparison learning.

As an example of determining weak supervisory signals of each prediction task for pre-training based on multi-modal information, disturbance processing may be performed on the fusion features, and similarity between the fusion features before disturbance and the fusion features after disturbance may be used as the weak supervisory signals for comparison learning.

As an example of determining weak supervisory signals of each prediction task for pre-training based on multi-modal information, cross-modal contrast learning may be performed using matching degrees of image information and text information as the weak supervisory signals.

As an example of determining weak supervisory signals of each prediction task for pre-training based on multi-modal information, the classification task learning of the labels may also be performed using the labels as weak supervisory signals.

The pre-training tasks can be executed in parallel, so that the multi-mode model is pre-trained through a plurality of prediction tasks to learn the common characteristics of the video samples.

And step 406, fine tuning the multi-mode pre-training model after pre-training based on the target training data of the target task to obtain a target task model.

After the pre-training is completed to obtain the multi-mode pre-training model, business model modeling is carried out on the basis, the target training data of the target task is utilized to finely tune the pre-trained multi-mode pre-training model, the sample size and the labeling cost required by the business model iteration are effectively reduced, the video multi-mode business algorithm model research and development progress is accelerated, the modeling effect is improved, and the model research and development time is shortened.

As an application example of the application applied to the image-text retrieval scene, after the pre-trained multi-mode pre-training model is obtained, the multi-mode pre-training model is finely tuned according to the image-text retrieval target training data, and the image-text retrieval target task model is obtained.

As an application example of the method applied to content classification, after a pre-trained multi-mode pre-training model is obtained, the multi-mode pre-training model is subjected to fine adjustment according to target training data of the content classification, and a target task model of the content classification is obtained.

As an application example of the method applied to content recommendation, after a pre-trained multi-mode pre-training model is obtained, the multi-mode pre-training model is subjected to fine adjustment according to target training data of content recommendation, and a target task model of content recommendation is obtained.

As an application example of the method applied to content quality detection, after a pre-trained multi-mode pre-training model is obtained, the multi-mode pre-training model is subjected to fine adjustment according to target training data of content quality, and a target task model of content quality detection is obtained.

On the one hand, the mode of obtaining the target task model is based on the multi-mode information of videos, the multi-mode pre-training model is pre-trained for a plurality of prediction tasks, on the basis, business model modeling fine adjustment is carried out, the sample size and the labeling cost required by business model iteration are effectively reduced, the research and development progress of a video multi-mode business algorithm model is accelerated, the model research and development time is shortened, on the other hand, weak supervision signals of all the prediction tasks are determined based on the multi-mode information of video samples, the multi-mode pre-training model is pre-trained for a plurality of prediction tasks, a large number of supervision samples are not required to be labeled through self-supervision training, the learning cost is reduced, and the learning efficiency is improved.

In another embodiment, the multi-mode pre-training model has a structure as shown in fig. 3, and includes an image feature extraction network corresponding to image mode information of each dimension, a text feature extraction network corresponding to text mode information of each dimension, and a task prediction network corresponding to each prediction task.

The image feature extraction network can use a Transformer structure, can capture the relation between local features and global features more than CNN, can ensure isomorphism of vision and text mode models, and is convenient for subsequent multi-mode fusion. The image modality information for each dimension has a corresponding image feature extraction network. For example, if the image includes a cover image and a key frame image, the image feature extraction network includes an image feature extraction network corresponding to the cover image and an image feature extraction network corresponding to the key frame image.

The text feature extraction network uses a transducer structure as well, so that isomorphism between vision and a text mode model can be ensured, and subsequent multi-mode fusion is facilitated. The text feature extraction network comprises a title, a topic label, action information, user comments, a text recognition result and a text feature extraction network corresponding to the speech recognition result.

The task prediction network may use a transducer structure, among other things. It will be appreciated that different task prediction tasks have corresponding task prediction networks.

As shown in fig. 3 and 5, determining weak supervisory signals of each prediction task based on multi-modal information of a video, and performing pre-training of a plurality of prediction tasks on a pre-training model, includes:

step 502, inputting the multi-mode information of the video sample into the feature extraction network of each dimension of the multi-mode pre-training model respectively to obtain text features of at least one dimension and image features of at least one dimension.

For example, the topic label, the author information, the user comment, the text recognition result and the speech recognition result are respectively input into a corresponding text feature extraction network, and the text feature extraction network is in a Bert (Transformer Encoder) model structure to obtain text features of each dimension.

For example, the cover image and the key frame image are respectively input into an image feature extraction network, such as a visual transducer structure, to obtain image features of each dimension.

And step 504, fusing text features of specified dimensions corresponding to the prediction tasks and image features of the specified dimensions to obtain at least one fusion feature required by the prediction tasks.

The fusion features can cross-modal fusion of text features specified by the prediction task and image features of the image dimension according to the requirements of the prediction task. Through experiments, considering the calculated amount and effect, a preferable fusion mode can be fusion of a cover image and a character recognition result, or fusion of a key frame image and a character recognition result, fusion of a key frame image and author information, and fusion of a key frame image and user comments.

Specifically, the manner of fusion may include the following two:

first kind: and fusing the image features of the cover and the text modal features of the non-topic labels to obtain fused features.

Second kind: fusing the key frame image features and the text modal features of the non-topic labels to obtain fused features; the text modality features of the non-topic tag include video titles, author information of the video, user comments of the video, text recognition results of the video, and voice recognition results of the video.

In the aspect of mode fusion, the traditional method generally carries out simple splicing and matrix operation, the mode fusion is carried out in a mode of a multi-head attention mechanism in a cross-mode, and the image-text features obtained in different scales are subjected to transition from the bottom layer of the model, so that the bottom layer multimode features are more fully involved in the expression of the model.

And step 506, determining weak supervision signals of the prediction tasks according to at least one of text features, image features and fusion features designated by the prediction tasks.

In this embodiment, compared with the traditional supervised learning method, weak supervision signals of the prediction task are respectively determined according to at least one of text features, image features and fusion features specified by the prediction task, so that the pre-training method is self-supervision training, a large number of supervision samples do not need to be manually marked, learning cost is reduced, and learning efficiency is improved.

Taking prediction task as a text single-mode contrast learning as an example, the similarity of the text characteristics of the text before disturbance and the text characteristics after disturbance can be used as a weak supervision signal of the text single-mode contrast learning.

Taking prediction task as image single-mode contrast learning as an example, the similarity of the image features of the text before disturbance and the image features after disturbance can be used as a weak supervision signal of the image single-mode contrast learning.

Taking the prediction task as an image-text matching task as an example, the matching degree of the image features and the text features can be used as a weak supervision signal of the image-text matching prediction task.

Taking the prediction task as a label classification task as an example, the label can be used as a weak supervision signal of the label classification task.

Step 508, inputting the input features required by each prediction task into a task prediction network corresponding to the prediction task, respectively performing the prediction tasks, and calculating the loss of each prediction task according to the weak supervision signals; the number of the prediction tasks is at least two; the input feature is at least one of a text feature, an image feature, and a fusion feature.

Specifically, input features required by each prediction task are input to a task prediction network corresponding to the prediction task, the prediction tasks are respectively carried out, prediction results of each prediction task are obtained, and loss of the prediction task, such as similarity loss, cross entropy loss and the like, is calculated according to the prediction results and weak supervision signals.

Step 510, back-propagating according to the loss of each prediction task, and updating the parameters of the pre-training model.

Each training task has a corresponding loss function, loss is calculated based on the loss function of each prediction task, back propagation is performed according to the loss, parameters of the multi-mode model are updated, and specifically, parameters of a text feature extraction network, an image feature extraction network and a task prediction network are updated.

When the multi-modal pre-training model meets the training stop condition, a trained multi-modal pre-training model is obtained, and the multi-modal pre-training model is utilized to accurately extract image features and text features.

In this embodiment, the weak supervisory signals of the prediction tasks are respectively determined according to at least one of the text features, the image features and the fusion features specified by the prediction tasks, so that the multi-mode pre-training model can be pre-trained by using a plurality of weak-supervision training tasks, a large number of supervision samples do not need to be marked, the learning cost is reduced, and the learning efficiency is improved.

In another embodiment, determining the weak supervisory signal of each prediction task according to at least one of the text feature, the image feature and the fusion feature specified by the prediction task, respectively, includes: if the prediction task is a comparison learning task of the same mode, constructing a positive sample pair according to at least two characteristics of the same mode designated by the comparison learning task, and taking the similarity of the positive sample pair as a weak supervision signal.

Specifically, the comparison learning task of the same mode may be a learning task of a text mode or a comparison learning task of an image mode.

If the prediction task is a comparison learning task of the same mode, a positive sample pair can be constructed according to the characteristics of at least two identical modes designated by the comparison learning task, the similarity of the positive sample pair is used as a weak supervision signal, a negative sample pair can be constructed according to the characteristics of at least two identical modes designated by the comparison learning task, and the similarity of the positive sample pair and the dissimilarity of the negative sample pair are used as weak supervision signals.

The core idea of contrast learning is to construct the diversity of the original samples through data enhancement, the design of a loss function is used for pulling in the distance between the positive samples and the anchor samples, the distance between the positive samples and the negative samples is increased, and in the process, the network is easier to learn the common characteristics of a plurality of samples after the source samples are subjected to data enhancement. Specifically, the characteristics before and after data enhancement in a certain dimension of the mode can be used as positive sample pairs, the characteristics in different dimensions of the mode can be used as negative sample pairs, and the similarity of the positive sample pairs and the dissimilarity of the negative sample pairs can be used as weak supervision signals. The mode of comparing the weak supervisory signals of learning constructed in the mode does not need to mark a large number of supervisory samples, thereby reducing the learning cost and improving the learning efficiency.

Specifically, if the prediction task is a contrast learning task of the same mode, constructing a positive sample pair according to at least two features of the same mode specified by the contrast learning task, and taking the similarity of the positive sample pair as a weak supervision signal, wherein the method comprises the following steps: if the prediction task is a contrast learning task of the same mode, carrying out data enhancement processing on original mode features of at least one dimension of the mode, constructing a positive sample pair according to the original mode features and the features after data enhancement, or outputting two features as positive sample pairs by different feature extraction layers of a feature extraction network of at least one dimension of the mode, and taking the similarity of the positive sample pairs as a weak supervision signal. Furthermore, the mode characteristics with different dimensions can be used as negative sample pairs.

The original mode features are features before disturbance of the mode.

Correspondingly, input features required by each prediction task are input to a task prediction network corresponding to the prediction task, the prediction tasks are respectively performed, and loss of each prediction task is calculated according to a weak supervision signal, and the method comprises the following steps: inputting the positive sample pair into a task prediction network corresponding to the comparison learning task to calculate the similarity distance of the positive sample pair; and obtaining the similarity loss according to the similarity distance of the positive sample pair.

Taking contrast learning of the same mode as contrast learning of an image mode as an example, enhancement processing can be performed on at least one of the images to obtain an enhanced image; the image and the enhanced image are respectively input into an image feature extraction network to obtain a first image feature and a second image feature, a positive sample pair is constructed according to the first image feature and the second image feature, the image features with different two dimensions are respectively input into the image feature extraction network to obtain a third image feature and a fourth image feature, and a negative sample pair is constructed according to the third image feature and the fourth image feature. The image may be a cover image or a key frame image.

If the cover image is subjected to enhancement processing, an enhanced cover image is obtained, and the cover image and the enhanced cover image are respectively input into an image feature extraction network to obtain a first image feature and a second image feature. And inputting the cover image and the non-key frame image into an image feature extraction network to obtain a third image feature and a fourth image feature.

The enhancement processing may be rotation, clipping, gaussian noise addition, masking, gray scale processing, filter addition, and the like, so as to obtain an enhanced image. The enhanced image may be used for self-supervised learning, e.g., the original image and the corresponding grayscale image may be used for self-supervised learning. As an example, the image may be processed using an application programming interface (Application Programmi ng Interface, PI) album dedicated to data enhancement processing, resulting in an enhanced image.

The corresponding task prediction network is a full-connection layer, and the image and the enhanced image are considered to be similar by the single-mode contrast learning task, so that the constructed loss function is a similarity loss function. And back-propagating parameters of the image feature image and the task prediction network according to the similarity values of the first image feature and the second image feature. The single-mode contrast learning task performs contrast learning by reducing the distance of the positive sample pair and increasing the distance of the negative sample.

Taking contrast learning of the same mode as contrast learning of a text mode as an example, at least one text is input into a corresponding text feature extraction network, and a first text feature and a second text feature are output through different feature extraction layers of the text feature extraction network.

The text can be a question label, author information, user comments, a word recognition result and a voice recognition result.

By inputting the text into the image feature extraction network, the same text is output through different feature extraction layers of the feature extraction network, and different first text features and second text features can be obtained due to the fact that the feature extraction layers pass through are different.

The corresponding task prediction network is a full-connection layer, and the single-mode comparison learning task considers that the first text feature and the second text feature output by the same text through different feature extraction layers of the image feature extraction network are similar, so that the constructed loss function is a similarity loss function. And adjusting parameters of the text feature extraction network and the task prediction network by back propagation according to the similarity values of the first text feature and the second text feature.

In the embodiment, the parameters of the text feature extraction network or the image feature extraction network can be trained by utilizing a single-mode contrast learning task, and a large number of supervision samples are not required to be marked on the contrast file, so that the learning cost is reduced, and the learning efficiency is improved.

In another embodiment, determining the weak supervisory signal of each prediction task according to at least one of the text feature, the image feature and the fusion feature specified by the prediction task, respectively, includes: if the prediction task is a cross-modal training task, determining a weak supervision signal according to text features specified by the cross-modal training task.

The cross-modal training task refers to a task of training by using information of at least two different modes, such as a text matching task according to text information and image information, or a multi-label classification task according to the text information and fusion information, and the like. In the cross-modal training task, corresponding text features can be designated as weak supervision signals according to the prediction task, so that the prediction task does not need to be marked with a large number of supervision samples, the learning cost is reduced, and the learning efficiency is improved.

In another embodiment, the cross-modal training task is a multi-label classification task that designates text features as weak supervisory signals as topic labels.

Specifically, a video content topic tag (Hashtag), which determines the accuracy of the traffic from the video. In short videos of a streaming content platform, publishers typically describe the content of the short video in some way. In the description, many users add "#" and text, and such descriptive language is called a topic label, which can be a perfect source of model training data. The topic labels are key words which are considered by video content creators, and can well express the key words or core semantic information. The topic label in the information flow business has the following characteristics:

When the cross-modal training task is a multi-label classification task, the topic labels are used as weak supervision signals, fusion features are input into a task prediction network corresponding to the multi-label classification task, predicted topic labels are obtained, cross entropy loss is calculated according to the topic labels and the predicted topic labels, and then parameters of a multi-modal pre-training model are adjusted according to the cross entropy loss.

The fusion feature can be obtained by fusing the image feature of the cover and the text mode feature of the non-topic label, and also can be obtained by fusing the image feature of the key frame and the text mode feature of the non-topic label. The text modality features of the non-topic tag include video titles, author information of the video, user comments of the video, text recognition results of the video, and voice recognition results of the video.

In this embodiment, since the topic labels are manually input by the original author and are used as weak supervision signals for training the multi-label classification task, the pre-training model can be helped to learn the semantic information of the video content, and the semantic features of the video content can be accurately extracted.

In another embodiment, the cross-modal prediction task is a cross-modal first matching prediction task of the image and the text, that is, the matching condition of the predicted image and the text, the model considers that the image and the text are matched, the set matching degree can represent a preset matching range, such as 80% -100%, and at this time, the set matching degree of the text feature and the image feature required by the first matching prediction task is used as a weak supervision signal.

Input features required by each prediction task are input to a task prediction network corresponding to the prediction task, the prediction tasks are respectively carried out, and the loss of each prediction task is calculated according to a weak supervision signal, and the method comprises the following steps: respectively inputting text features and image features required by a first matching prediction task into a corresponding task prediction network to predict a first prediction matching degree of the text features and the image features; and calculating the cross entropy loss according to the set matching degree and the first prediction matching degree.

Specifically, the cover image and the text recognition result can be considered to be matched, the corresponding task prediction network is a full-connection layer, the corresponding full-connection layer is called to predict the first prediction matching degree of text features and image features, cross entropy loss is calculated according to the set matching degree and the first prediction matching degree, and parameters of the full-connection layer, the text feature extraction network and the image feature extraction network in the multi-mode pre-training model are adjusted according to cross entropy loss back propagation.

In this embodiment, the cross-modal matching prediction task of the image and the text can train the feature extraction network of the image and the text, and assist the self-supervision pre-training model to extract the semantic features of the video content.

In another embodiment, when the cross-modal training task is a cross-modal second matching prediction task of the topic label and the fusion feature, the set matching degree of the fusion feature and the topic label required by the second matching prediction task is used as a weak supervision signal.

Specifically, the topic label is a keyword which is considered by a video content creator, and can well express the keyword or core semantic information. And a second cross-modal matching prediction task of the tag and the fusion feature, namely predicting the matching condition of the tag and the fusion feature, wherein the model considers that the tag and the fusion feature are matched, and the set matching degree can represent a preset matching range, such as 80% -100%.

Input features required by each prediction task are input to a task prediction network corresponding to the prediction task, the prediction tasks are respectively carried out, and the loss of each prediction task is calculated according to a weak supervision signal, and the method comprises the following steps: respectively inputting fusion features and topic labels required by a second matched prediction task into a corresponding task prediction network, and predicting second prediction matching degrees of topic label features and fusion features; and calculating the cross entropy loss according to the set matching degree and the second prediction matching degree. Specifically, the labels and the fusion features can be considered to be matched, the corresponding task prediction network is a full-connection layer, the second prediction matching degree of the corresponding full-connection layer prediction label features and the fusion features is called, cross entropy loss is calculated according to the set matching degree and the second prediction matching degree, and parameters of the full-connection layer, the text feature extraction network and the image feature extraction network in the multi-mode pre-training model are adjusted according to cross entropy loss back propagation.

In this embodiment, the cross-modal matching prediction task of the tag text and the fusion feature can train the feature extraction network of the image and the text, and assist the self-supervision pre-training model to extract the semantic features of the video content.

In another embodiment, when the cross-modal predictive task is a cross-modal mask language predictive task, the mask language predictive task specifies a weak supervisory signal as a pre-mask text feature.

Specifically, when the cross-modal prediction task is a cross-modal mask language prediction task, the method further includes: shielding the characters in the image to obtain a shielded image; and performing word recognition processing on the image after shielding to obtain first words, and performing word recognition processing on the image before shielding to obtain second words.

Input features required by each prediction task are input to a task prediction network corresponding to the prediction task, the prediction tasks are respectively carried out, and the loss of each prediction task is calculated according to a weak supervision signal, and the method comprises the following steps: inputting the features of the first characters into a prediction task network corresponding to the masking language prediction task to obtain a prediction result of the masking characters; the masking language prediction task may be a MLM (Masked language Modeling) task that predicts masked words in a text sequence, such as english learning in a graph, based primarily on image content and context information of the text sequence, pre-trains by masking some of the words in the "english learning requires a certain language environment" and predicting the probability of occurrence of those words by pre-selecting a network, based on the prediction results and the second word calculation cross entropy penalty. The corresponding task prediction network is a text generation network, the first character feature is input into the text generation network to obtain a prediction result of the shielding characters, cross entropy loss is calculated according to the prediction result and the second characters, and parameters of the text generation network, the text feature extraction network and the image feature extraction network in the multi-mode pre-training model are adjusted according to cross entropy loss back propagation.

In the embodiment, through the cross-modal shielding language prediction task, text features are utilized as weak supervision signals, a large number of supervision samples do not need to be marked, the learning cost is reduced, and the learning efficiency is improved.

As shown in fig. 6 and 7, the task prediction method based on the video multimodal information includes three stages:

the first stage is a pre-training stage of the multi-mode pre-training model, the second stage is a target task model obtained by the multi-mode pre-training model which is completed by pre-training, and the third stage is task prediction by the target task model.

The first phase comprises:

step 600, pre-training data is obtained, the pre-training data comprising multi-modal information of video samples.

The multi-modal information of the video comprises text modal information of at least one dimension, such as topic labels, titles, authors, text recognition results, voice recognition results and user comments of the video, and image modal information of at least one dimension, such as cover images and/or key frame images.

After the multi-mode information of the video is acquired, the multi-mode information is preprocessed.

The preprocessing may include: the video content cover map uses the albums (the album package is an API specially written for data enhancement, basically contains a large amount of data enhancement means, such as rotation, clipping, gaussian noise, covering, color conversion, filter and other picture operations, and can be directly used for picture preprocessing) package to carry out data enhancement processing, the data enhancement package carries out data enhancement and change processing, the changed version carries out self-supervision learning, and during the data preprocessing, because the video cover map cover of part of video content has missing parts, blank cover filling with 0 is adopted, and mask is recorded, and the matching of the blank map is required to be masked in the loss of matching the cover map and the text. The preprocessing also includes word segmentation of the text modal information, as shown in fig. 7, by the Bert model tokenizer (length 64), which words the title and the topic tag respectively, and then inputs the models respectively.

As shown in fig. 7, the Video-level multi-modal training model is constructed by using the above data, and the bottommost layer is multi-modal input, including input of signals such as vision (keyframe information extracted by cover map and Video), text (with tile, hashtag, comment, author information, OCR and ASR extracted by Video content, etc.).

Step 602, inputting the cover image into a corresponding first image feature extraction network to obtain the cover image feature, and inputting the key frame image into a second image feature extraction network to obtain the key frame image feature.

Step 604, inputting the topic label into a corresponding first text feature extraction network to obtain topic label features, inputting user comment information into a corresponding second text feature extraction network to obtain comment text features, inputting author information into a corresponding third text feature extraction network to obtain author text features, inputting text recognition results in video into a fourth text feature extraction network to obtain text recognition result features, and inputting voice recognition results in video into a fifth text feature extraction network to obtain voice recognition result features.

And outputting the first text feature and the second text feature through different feature extraction layers of the text feature extraction network by any text input corresponding text feature extraction network.

Step 606, fusing the specified text features and the specified image features to obtain fused features.

The method comprises the steps of obtaining a fusion feature by fusing the image feature of the cover and the text recognition result feature, obtaining the fusion feature by fusing the image feature of the key frame and the text recognition result feature of the text, obtaining the fusion feature by fusing the image feature of the key frame and the text feature of the author, and obtaining the fusion feature by fusing the image feature of the key frame and the comment text feature.

And 608, performing a comparison learning task of the same mode, specifically calling a task prediction network corresponding to the comparison learning task to calculate the similarity distance of the positive sample pair, and obtaining the similarity loss according to the similarity distance of the positive sample pair.

The contrast learning task of the same mode comprises a contrast learning task of a text mode and a contrast learning task of an image mode.

Step 610, performing multi-label classification tasks, specifically calling task prediction networks corresponding to the comparison learning tasks to obtain predicted topic labels, and calculating cross entropy loss according to the topic labels and the predicted topic labels.

Step 612, performing a cross-modal first matching prediction task of the image and the text, specifically, invoking a corresponding task prediction network to predict text features and a first prediction matching degree of the image features; and calculating the cross entropy loss according to the set matching degree and the first prediction matching degree.

Step 614, a cross-modal second matching prediction task of the tag and fusion features is performed. Specifically, a second prediction matching degree of the corresponding task prediction network prediction tag feature and the fusion feature is called, and cross entropy loss is calculated according to the set matching degree and the second prediction matching degree.

Step 616, cross-modal mask language prediction tasks are performed. Specifically, firstly, shielding processing is carried out on characters in an image to obtain a shielded image, character recognition processing is carried out on the shielded image to obtain a first character, and character recognition processing is carried out on the image before shielding processing to obtain a second character. And calling a prediction result of the corresponding task prediction network prediction shielding text, and calculating the cross entropy loss according to the prediction result and the second text.

Wherein the above-described plurality of prediction tasks may be in parallel.

And 618, back-propagating according to the loss of each prediction task, and updating the parameters of the multi-mode pre-training model.

As shown in fig. 7, in order to achieve better performance of the pre-training model, the whole model adopts multiple optimization schemes and multiple pre-training tasks, including model optimization methods such as visual transducer (combining a cover map and video multiframe), text transducer structure (including title, hashtag, author information, user comments and the like), and feature layer cross-modal fusion. Specifically: in the aspect of modal training, a Transformer structure is used for vision, and compared with CNN, the relationship between local features and global features can be captured, isomorphism between vision and a text modal model can be ensured, and subsequent multi-modal fusion is facilitated; in the aspect of mode fusion, the traditional method generally carries out simple splicing and matrix operation, the mode fusion is carried out in a mode of a multi-head attention mechanism in a cross-mode, and the image-text features obtained in different scales are subjected to transition from the bottom layer of the model, so that the bottom layer multimode features are more fully involved in the expression of the model. The topmost multitasking pre-training stage, like other pre-training models, also incorporates self-supervised learning as the task of model training. In the self-supervision aspect of the image modes, a combined pre-training task of different modes is designed, and the relation among different modes is fully learned. In supervised tasks, the model is primarily concerned with multi-label classification tasks.

As shown in FIG. 7, the method comprises a cross-modal matching prediction task of images and texts, similarity loss before and after multi-modal self-supervision transformation, contrast learning of fusion modalities, language research and development learning of MLM (Mask Language Model) and prediction of various pre-training tasks of HashTag categories. Through the action of a single model and a plurality of modes, the combination of the modes designs a multi-mode pre-training task, and the result of designing a plurality of Loss constraints reaches various semantic information in the final full multi-angle learning corpus.

The backbone network for extracting the content features adopts a transform structure, which discards the traditional CNN and RNN, and the whole network structure is completely composed of an attribute mechanism. The transducer consists of and consists of self-Transmission and Feed Forward Neural Network only. The framework for extracting the multi-mode characteristics of the large-scale video content is constructed through the above-mentioned multiple self-supervision pre-training tasks, so that the video content can be mapped to a space with semantic information representing the video content, and a very good basic model is provided for constructing multiple models of video understanding services and video content embedding in the next step.

After the pre-training is performed to obtain a pre-trained multi-modal pre-training model, in a second stage, step 620, fine tuning is performed on the pre-trained multi-modal pre-training model based on the target training data of the target task to obtain a target task model.

As an application example of the application applied to the image-text retrieval scene, after the pre-trained multi-mode pre-training model is obtained, the multi-mode pre-training model is finely tuned according to the target training data of the image-text retrieval, and the task model of the image-text retrieval is obtained.

As an application example of the method applied to content classification, after a pre-trained multi-mode pre-training model is obtained, the multi-mode pre-training model is subjected to fine adjustment according to target training data of the content classification, and a task model of the content classification is obtained.

As an application example of the method applied to content recommendation, after a pre-trained multi-mode pre-training model is obtained, the multi-mode pre-training model is subjected to fine adjustment according to target training data of content recommendation, and a task model of content recommendation is obtained.

As an application example of the method applied to content quality detection, after a pre-trained multi-mode pre-training model is obtained, the multi-mode pre-training model is subjected to fine adjustment according to target training data of content quality, and a task model of content quality detection is obtained.

After the target task model is obtained, the third stage comprises the following steps:

step 622, obtaining multi-modal information of the video to be predicted, wherein the multi-modal information includes text modal information of at least one dimension and image modal information of at least one dimension.

Specifically, multi-modal information of the video to be predicted is obtained, wherein the multi-modal information comprises text modal information of at least one dimension, such as topic labels, titles, authors, text recognition results, voice recognition results and user comments of the video. The multimodal information also includes image modality information of at least one dimension, such as a cover image and/or a key frame image. The cover image may be selected by the author at the time of distribution and if not selected, is typically the first frame of the video content.

Step 624, inputting the multimodal information of the video into the feature extraction network of each dimension of the target task model, obtaining text features of at least one dimension and image features of at least one dimension, and outputting a prediction result of the target task according to the text features of at least one dimension and the image features of at least one dimension through the task prediction network of the target task model.

As an application example of the method applied to content classification, multimodal information of a video is respectively input into feature extraction networks of each dimension of a task model of the content classification to obtain text features of at least one dimension and image features of at least one dimension, and a task prediction network of the task model of the content classification outputs a prediction result of the content classification according to the text features of at least one dimension and the image features of at least one dimension

As an application example of the method applied to content quality detection, multimodal information of a video is respectively input into feature extraction networks of each dimension of a task model of content quality detection to obtain text features of at least one dimension and image features of at least one dimension, and a task prediction network of the task model of content quality detection outputs a prediction result of content quality according to the text features of at least one dimension and the image features of at least one dimension.

According to the task prediction method based on the video multi-modal information, the multi-modal information of the video is input into the target task network, the prediction result of the target task can be obtained through the target task network, the target task model is obtained by fine adjustment of the multi-modal pre-training model which is achieved through pre-training, the feature learning corresponding to the target task can be achieved only by fine adjustment of the multi-modal pre-training model which is achieved through the target training data which is needed by the target task, the target task model is obtained, the sample quantity and the labeling cost of iteration of a service model can be effectively reduced in the training process of the target task model, and the training efficiency of the target task model is improved. In the pre-training process, the multi-modal model is pre-trained to obtain a plurality of prediction tasks, and the weak supervision signals of the prediction tasks are determined based on the multi-modal information of the video in the pre-training data, so that the pre-training mode is self-supervision training, a large number of supervision samples do not need to be marked manually, the learning cost is reduced, and the learning efficiency is improved.

In summary, task prediction based on video multi-mode information provided by the embodiment of the application has the following beneficial effects:

(1) Providing a large-scale video content pre-training basic model, and performing business model modeling fine tuning on the basis of the basic model to effectively reduce the sample size and the labeling cost required by business model iteration, accelerate the development progress of a video multi-mode business algorithm model, promote the modeling effect and shorten the model development time;

(2) Video content capable of pre-training high quality multi-modal video _embedding The method and the device provide new possibilities for recommending recall, such as the use of Embedding for accurately searching the works most relevant to the currently estimated works from the consumption history of the user, improving the semantic matching degree, estimating the interested degree of the user to the estimated works from the result, and further improving the core indexes such as the average video watching duration and the like through long-period interest combination.

As shown in fig. 8, the functions of each service module in the task prediction system based on video multi-mode information according to the embodiment of the present application will be described below.

In particular, content production and consumption end

(1) PGC or UGC (user generate content), MCN content producer, through mobile terminal or back-end interface API system, provides graphics context or video content, which are the main content sources of recommended distribution content;

(2) Through communication with the uplink and downlink content interface service, the video content distribution is usually a shooting end, and the local video content can be selected to match with music, filter templates, beautifying functions of the video and the like in the shooting process;

(3) As a consumer, the system communicates with an uplink and downlink content interface server, pushes through recommendation to obtain index information of access content, then communicates with a content storage server, obtains corresponding content including recommendation to obtain content, subscribes to the content, stores the content storage server as a content entity such as a video source file and a picture source file of a cover map, and stores meta information such as a title, an author, the cover map, classification, tag information and the like of the content in a content database;

(4) Meanwhile, behavior data played by a user in the uploading and downloading processes are reported to the back end for statistical analysis, such as blocking, loading time, playing clicking and the like;

(5) The consumption end generally consumes the content data in a Feeds stream mode, and then continuously refreshes the content data in a sliding mode, so that more recommended results are obtained from the server, and the content data is similar to a waterfall stream mode;

uplink and downlink content interface server

(1) The method comprises the steps of directly communicating with a content production end, and storing content submitted from the front end, typically a title, a publisher, a abstract, a cover map and a release time of the content, into a content database;

(2) Writing meta information of video content, such as file size, code rate, resolution, title, release time, author, etc. into a content database;

(3) Synchronizing the issued submitted content to a dispatching center server for subsequent content processing and circulation;

third, content database

(1) The core database of the content, in which the meta information of the content released by all producers is stored, is focused on the meta information of the content itself such as file size, cover map link, code rate, file format, title, release time, author, video file size, video format, whether original marking or first generation further comprises classification of content in the manual auditing process (comprising first, second and third level classification and label information, such as video content of a mobile phone of a certain brand, wherein the first level classification is science and technology, the second level classification is a smart mobile phone, the third level classification is a domestic mobile phone, and the label information is a mobile phone model of a certain brand);

(2) The information in the content database is read in the manual auditing process, and meanwhile, the result and the state of the manual auditing are returned to the content database;

(3) The scheduling center processes the content mainly comprises machine processing and manual auditing processing, wherein the machine processing core carries out various quality judgments such as low-quality filtering, content labels such as classification and label information, and content duplication elimination, the results of the content are written into a content database, and the repeated secondary processing of the same content can not be carried out on manual work;

Fourth, scheduling center & manual auditing system

(1) The whole dispatching process of the content circulation is responsible for receiving the content in storage through the uplink and downlink content interface servers, and then acquiring meta-information of the content from a content database;

(2) The scheduling manual auditing system and the machine processing system control the scheduling sequence and priority;

(3) Content is enabled through a manual auditing system, and then is provided to content consumers of the terminal through a content outlet distribution service (usually a recommendation engine or a search engine or operation) directly on a display page, namely content index information obtained by a consumer terminal;

(4) The manual auditing system is a carrier of manual service capability and is mainly used for auditing and filtering contents which cannot be determined by machines such as disallowed by sensitive law, and labeling classification labels of video contents with secondary confirmation;

fifth, content storage service

(1) Content entity information other than meta information of the content, such as a video source file and a cover map source file of the content;

(2) The terminal directly downloads the corresponding file through the content storage server to play and display through the access address of the content;

sixth, content processing sample library

(1) The atomic models corresponding to various video content processing services in the video understanding service are as follows: video sharpness, video black and white edges, video discomfort, and the like. The modeling of the video content business problems all needs to mark samples, and a small amount of business samples needed in the final business modeling stage are saved to be used for fine tuning the final business model on the basis of matching with the large-scale video pre-training model;

seventh, video content processing business model and service

(1) Models corresponding to various atomic capacities exist in the information flow video content processing process, an atomic feature library for understanding various video contents is generated, and the corresponding atomic models are as follows: video sharpness, video black and white edges, video discomfort, and the like.

(2) Meanwhile, service models obtained based on a large-scale pre-training model and fine tuning learning are served and communicated with a dispatching center service to finish generation of atomic characteristics of video content;

(3) Receiving the dispatching of the dispatching center server, completing various processes of video content service application, and storing the processing results in a content database;

eighth, video multi-mode pre-training model

(1) Collecting large-scale video pre-training data and auxiliary information according to the method, and carrying out necessary cleaning, filtering and enhancing treatment on the data;

(2) Then, based on the collected data, completing the construction of a large-scale multi-mode video content pre-training model according to the construction mode of the pre-training model structure and the pre-training task by a method combining self-supervision and weak supervision;

(3) The constructed video content business atomic service model is based on the multi-mode video content pre-training model to carry out Finetune, and meanwhile, a large model can be distilled to obtain a final model, and in the fine tuning process, a small amount of supervision sample data of a business scene can be used for fine tuning, so that the development progress of the model is accelerated;

nine video pre-training content library

Storing a corresponding video pre-training data corpus crawled from the Internet, mainly retrieving the disclosed video data through a search engine by Query words, and also comprising the pre-training video data and auxiliary text information in the service field collected by various ways;

ten-step crawling and data preprocessing system

Corresponding video data is crawled from the Internet through a Query constructed by the Hashtag of the streaming video content cleaning according to the method described above.

According to the method, topic labels of information flow video content release are utilized, video content comment information, video content description text semantic extraction information (such as title) and the like of a user are used as weak supervision signals, meanwhile, based on different modes, different pre-training tasks are designed based on comparison learning and weak supervision learning, a video-level pre-training model based on large-scale video content is built to serve as a basic feature extraction model of video understanding service modeling, and the video-level pre-training model is used as a basic model of service field modeling to extract features of videos.

The method can provide a large-scale video content pre-training basic model, and business model modeling fine adjustment is carried out on the basis of the basic model, so that the sample size and the labeling cost required by business model iteration are effectively reduced; the method has the advantages that the special video multidimensional data supplemented in the service field can be fully utilized to learn in advance, the development progress of the video multimode service algorithm model is accelerated, the modeling effect is improved, the model development time is shortened, for example, the extraction characteristics of a multimode video pre-training basic model are added, and the flow which can be obtained by one video distribution can be estimated rapidly through fine adjustment, so that the better flow of a high-quality video can be obtained as soon as possible, and the cold starting speed of video content is accelerated; the video content enabling method and device can enable video content enabling high-quality multi-mode video pre-training to be built to provide new possibility for recommending recall, for example enabling Embedding to be used for accurately searching works most relevant to a currently estimated work from user consumption history, improving semantic matching degree, estimating the interest degree of a user in the work to be estimated from results, and further improving core indexes such as average video watching duration and the like through long-period interest combination.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a task prediction device based on the video multi-mode information, which is used for realizing the task prediction method based on the video multi-mode information. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the task prediction device based on the video multimodal information provided below may be referred to as the limitation of the task prediction method based on the video multimodal information, and will not be repeated herein.

In one embodiment, as shown in fig. 9, there is provided a task prediction apparatus based on video multimodal information, including:

the acquiring module 902 is configured to acquire multimodal information of a video, where the multimodal information includes text modality information of at least one dimension and image modality information of at least one dimension.

The prediction module 904 is configured to input multimodal information of a video into feature extraction networks of each dimension of a target task model, obtain text features of at least one dimension and image features of at least one dimension, and output a prediction result of a target task according to the text features of at least one dimension and the image features of at least one dimension through a task prediction network of the target task model;

The target task model is obtained by fine tuning a multi-mode pre-training model which is subjected to pre-training, the multi-mode pre-training model is obtained by pre-training a plurality of prediction tasks on the pre-training model by utilizing multi-mode pre-training data, and weak supervision signals of all the prediction tasks are determined based on multi-mode information of videos in the pre-training data.

According to the task prediction device based on the video multi-modal information, the multi-modal information of the video is input into the target task network, the prediction result of the target task can be obtained through the target task network, the target task model is obtained by fine adjustment of the multi-modal pre-training model which is achieved through pre-training, so that feature learning corresponding to the target task can be achieved only by fine adjustment of the multi-modal pre-training model which is achieved through the target training data which is needed by the target task, the target task model is obtained, the sample quantity and the labeling cost for iteration of a service model can be effectively reduced in the training process of the target task model, the training efficiency of the target task model is improved, and the model online efficiency is further improved. In the pre-training process, the multi-modal model is pre-trained to obtain a plurality of prediction tasks, and the weak supervision signals of the prediction tasks are determined based on the multi-modal information of the video in the pre-training data, so that the pre-training mode is self-supervision training, a large number of supervision samples do not need to be marked manually, the learning cost is reduced, and the learning efficiency is improved.

In one embodiment, a task prediction apparatus based on video multimodal information includes:

the data acquisition module is used for acquiring pre-training data, wherein the pre-training data comprises multi-mode information of a video sample;

the pre-training module is used for determining weak supervision signals of all prediction tasks based on multi-mode information of the video sample and pre-training a plurality of prediction tasks on the pre-training model;

and the fine tuning module is used for fine tuning the multi-mode pre-training model after pre-training based on the target training data of the target task to obtain a target task model.

Wherein, pretraining module includes:

the feature extraction module is used for respectively inputting the multi-modal information of the video sample into the feature extraction network of each dimension of the pre-training model to obtain text features of at least one dimension and image features of at least one dimension.

And the fusion module is used for fusing the text features of the specified dimension corresponding to the prediction task and the image features of the specified dimension to obtain at least one fusion feature required by the prediction task.

And the weak supervision signal determining module is used for respectively determining weak supervision signals of all the prediction tasks according to at least one of text features, image features and fusion features designated by the prediction tasks.

The multi-task prediction module is used for inputting input features required by each prediction task into a task prediction network corresponding to the prediction task, respectively carrying out the prediction tasks, and calculating the loss of each prediction task according to the weak supervision signals; the number of the prediction tasks is at least two; the input feature is at least one of a text feature, an image feature, and a fusion feature.

And the parameter adjusting module is used for carrying out back propagation according to the loss of each prediction task and updating the parameters of the pre-training model.

In another embodiment, the weak supervisory signal determining module is configured to construct a positive sample pair according to at least two features of the same modality specified by the contrast learning task if the prediction task is the contrast learning task of the same modality, and use the similarity of the positive sample pair as the weak supervisory signal.

In another embodiment, if the prediction task is a contrast learning task of the same mode, the weak supervisory signal determining module is configured to perform data enhancement processing on an original mode feature of at least one dimension of the mode if the prediction task is a contrast learning task of the same mode, and construct a positive sample pair according to the original mode feature and the feature after data enhancement, or output two features as a positive sample pair by different feature extraction layers of a feature extraction network of at least one dimension of the mode, and use a similarity of the positive sample pair as a weak supervisory signal.

The multi-task prediction module is used for inputting the positive sample pair into a task prediction network corresponding to the comparison learning task to calculate the similarity distance of the positive sample pair; and obtaining the similarity loss according to the similarity distance of the positive sample pair.

In another embodiment, the weak supervisory signal determining module is configured to determine the weak supervisory signal according to text features specified by the cross-modal training task if the predicted task is the cross-modal training task.

In another embodiment, when the cross-modal training task is a multi-label classification task, the multi-label classification task designates the text feature as a weak supervisory signal as a topic label.

The multi-task prediction module is used for inputting the fusion characteristics into a task prediction network corresponding to the multi-label classification task to obtain a predicted topic label; from the topic label and the predicted topic label, a cross entropy loss is calculated.

In another embodiment, when the cross-modal training task is a cross-modal first matching prediction task of the image and the text, the set matching degree of the text feature and the image feature required by the first matching prediction task is used as a weak supervision signal.

The multi-task prediction module is used for respectively inputting text features and image features required by the first matching prediction task into the corresponding task prediction network to predict the first prediction matching degree of the text features and the image features; and calculating the cross entropy loss according to the set matching degree and the first prediction matching degree.

The multi-task prediction module is used for respectively inputting fusion features and topic labels required by a second matched prediction task into a corresponding task prediction network and predicting second predicted matching degrees of topic label features and fusion features; and calculating the cross entropy loss according to the set matching degree and the second prediction matching degree.

When the cross-modal prediction task is a cross-modal mask language prediction task, the weak supervisory signal designated by the mask language prediction task is a text feature before mask processing.

The task prediction apparatus based on the video multimodal information further includes:

and the shielding processing module is used for shielding the characters in the image to obtain a shielded image.

The recognition module is used for carrying out word recognition processing on the image after shielding to obtain first words, and carrying out word recognition processing on the image before shielding to obtain second words.

The multi-task prediction module is used for inputting the characteristics of the first characters into a prediction task network corresponding to the masking language prediction task to obtain a prediction result of the masking characters; and calculating the cross entropy loss according to the prediction result and the second text.

In another embodiment, the fusion module is configured to fuse the image feature of the cover and the text modal feature of the non-topic label to obtain a fused feature, or fuse the image feature of the key frame and the text modal feature of the non-topic label to obtain a fused feature; the text modality features of the non-topic tag include video titles, author information of the video, user comments of the video, text recognition results of the video, and voice recognition results of the video.

The above-described modules in the video multimodal information based task prediction apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store training data and multimodal pre-training models. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a task prediction method based on video multimodal information.

It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor executes the computer program to implement the task prediction method based on the video multimodal information of the above embodiments.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the video multimodal information based task prediction method of the above embodiments.

In one embodiment, a computer program product is provided, comprising a computer program that when executed by a processor implements the video multimodal information based task prediction method of the embodiments described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A task prediction method based on video multi-modal information, the method comprising:

2. The method according to claim 1, wherein the obtaining the target task model includes:

acquiring pre-training data, wherein the pre-training data comprises multi-modal information of a video sample;

based on the multi-mode information of the video sample, determining weak supervision signals of all prediction tasks, and pre-training a plurality of prediction tasks on a pre-training model;

and fine tuning the multi-mode pre-training model after pre-training is completed based on target training data of the target task to obtain the target task model.

3. The method of claim 2, wherein determining weak supervisory signals for each predictive task based on the multimodal information of the video and pre-training a pre-training model for a plurality of the predictive tasks comprises:

Respectively inputting the multi-modal information of the video sample into feature extraction networks of each dimension of the pre-training model to obtain text features of at least one dimension and image features of at least one dimension;

fusing text features of specified dimensions corresponding to the prediction tasks and image features of the specified dimensions to obtain at least one fusion feature required by the prediction tasks;

respectively determining weak supervision signals of the prediction tasks according to at least one of the text features, the image features and the fusion features specified by the prediction tasks;

inputting the input features required by each prediction task into a task prediction network corresponding to the prediction task, respectively performing the prediction tasks, and calculating the loss of each prediction task according to the weak supervision signals; the number of the prediction tasks is at least two; the input feature is at least one of the text feature, the image feature and the fusion feature;

and carrying out back propagation according to the loss of each prediction task, and updating the parameters of the pre-training model.

4. A method according to claim 3, wherein said determining the weak supervisory signals for each of said predictive tasks based on at least one of said text features, image features and fusion features specified by said predictive tasks, respectively, comprises:

If the prediction task is a comparison learning task of the same mode, constructing a positive sample pair according to at least two characteristics of the same mode designated by the comparison learning task, and taking the similarity of the positive sample pair as a weak supervision signal.

5. The method of claim 4, wherein if the prediction task is a comparison learning task of a same modality, constructing a positive sample pair according to at least two features of a same modality specified by the comparison learning task, and wherein the similarity of the positive sample pair is a weak supervisory signal, comprising:

if the prediction task is a contrast learning task of the same mode, carrying out data enhancement processing on original mode features of at least one dimension of the mode, constructing a positive sample pair according to the original mode features and the features after data enhancement, or outputting two features as a positive sample pair by different feature extraction layers of a feature extraction network of at least one dimension of the mode, and taking the similarity of the positive sample pair as a weak supervision signal;

inputting the input features required by each prediction task into a task prediction network corresponding to the prediction task, respectively performing the prediction tasks, and calculating the loss of each prediction task according to the weak supervision signal, wherein the method comprises the following steps:

Inputting the positive sample pair into a task prediction network corresponding to the contrast learning task to calculate the similarity distance of the positive sample pair;

and obtaining the similarity loss according to the similarity distance of the positive sample pair.

6. A method according to claim 3, wherein said determining the weak supervisory signals for each of said predictive tasks based on at least one of said text features, image features and fusion features specified by said predictive tasks, respectively, comprises:

if the prediction task is a cross-modal training task, determining a weak supervision signal according to text features appointed by the cross-modal training task.

7. The method of claim 6, wherein when the cross-modal training task is a multi-label classification task, the multi-label classification task designates text features as weak supervisory signals as topic labels;

inputting the input features required by each prediction task into a task prediction network corresponding to the prediction task, respectively performing the prediction tasks, and calculating the loss of each prediction task according to the weak supervision signals, wherein the method comprises the following steps:

inputting the fusion features into a task prediction network corresponding to the multi-label classification task to obtain a predicted topic label;

And calculating cross entropy loss according to the topic label and the predicted topic label.

8. The method according to claim 6, wherein when the cross-modal training task is a cross-modal first matching prediction task of images and texts, setting matching degrees of text features and image features required by the first matching prediction task as weak supervisory signals;

respectively inputting text features and image features required by the first matching prediction task into a corresponding task prediction network, and predicting a first prediction matching degree of the text features and the image features;

and calculating cross entropy loss according to the set matching degree and the first prediction matching degree.

9. The method according to claim 6, wherein when the cross-modal training task is a cross-modal second matching prediction task of a topic label and a fusion feature, setting matching degrees of the fusion feature required by the second matching prediction task and the topic label as weak supervision signals;

respectively inputting the fusion characteristics required by the second matching prediction task and the topic label into a corresponding task prediction network, and predicting the second prediction matching degree of the topic label characteristics and the fusion characteristics;

and calculating cross entropy loss according to the set matching degree and the second prediction matching degree.

10. The method of claim 6, wherein when the cross-modal predictive task is a cross-modal mask language predictive task, the mask language predictive task specifies a weak supervisory signal as a pre-mask processing text feature;

the method further comprises the steps of:

shielding the characters in the image to obtain a shielded image;

performing word recognition processing on the shielded image to obtain first words, and performing word recognition processing on the image before the shielding processing to obtain second words;

Inputting the characteristics of the first characters into a prediction task network corresponding to the masking language prediction task to obtain a prediction result of the masking characters;

and calculating cross entropy loss according to the prediction result and the second text.

11. A method according to claim 3, wherein the fusion of text features of specified dimensions and image features of specified dimensions corresponding to each of the prediction tasks results in at least one fusion feature required for the prediction task, comprising at least one of the following means;

first kind: fusing the image features of the cover and the text modal features of the non-topic labels to obtain fused features;

second kind: fusing the key frame image features and the text modal features of the non-topic labels to obtain fused features; the text modal characteristics of the non-topic labels comprise video titles, author information of the video, user comments of the video, text recognition results of the video and voice recognition results of the video.

12. A task prediction apparatus based on video multimodal information, the apparatus comprising: