CN113762322A

CN113762322A - Video classification method, device and equipment based on multi-modal representation and storage medium

Info

Publication number: CN113762322A
Application number: CN202110436918.0A
Authority: CN
Inventors: 李伟康; 陈小帅; 孙星海
Original assignee: Tencent Technology Beijing Co Ltd
Current assignee: Tencent Technology Beijing Co Ltd
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-12-07

Abstract

The application discloses a video classification method, device and equipment based on multi-mode representation and a storage medium, relates to the technical field of artificial intelligence, and is used for reducing the difficulty of model learning and improving the efficiency of model training. The method comprises the following steps: inputting data information of each mode of the target video into a trained target multi-mode video representation model; obtaining the video service category of a target video output by the target multi-modal video representation model in a target service scene; the target multi-modal video representation model is obtained by performing video frequency domain adaptive pre-training on the basis of base video data sample sets corresponding to the modes respectively and performing retraining on the basis of video service data sample sets corresponding to the modes respectively in a target service scene, each base video data sample set comprises base video data samples corresponding to the same mode respectively, and each video service data sample set comprises video service data samples corresponding to the same mode respectively.

Description

Video classification method, device and equipment based on multi-modal representation and storage medium

Technical Field

The application relates to the technical field of computers, in particular to the technical field of Artificial Intelligence (AI), and provides a video classification method, device and equipment based on multi-modal representation and a storage medium.

Background

With the development of the AI technology, the AI is involved in each video service scene of the video platform, such as video recommendation classification, video operation classification, video title party identification service scene, and the like. By taking video operation classification as an example, the operation category of the video can be intelligently identified through an AI technology, and then the video distribution can be assisted by operators.

Generally, videos relate to multi-modal content such as video texts, video pictures and audio in the videos, and therefore the videos are generally represented based on multi-modal data, that is, a multi-modal representation model is constructed based on the multi-modal data of the videos.

Disclosure of Invention

The embodiment of the application provides a video classification method, a video classification device, video classification equipment and a storage medium based on multi-modal representation, which are used for reducing the difficulty of model learning and improving the efficiency of model training.

In one aspect, a method for video classification based on multi-modal representation is provided, the method comprising:

acquiring data information of a target video corresponding to each mode, and inputting the data information of each mode to a trained target multi-mode video representation model;

obtaining the video service category of the target video output by the target multi-modal video representation model in a target service scene;

the target multi-modal video representation model is obtained by performing video frequency domain adaptive pre-training on the basis of base video data sample sets corresponding to the modes respectively and performing retraining on the basis of video service data sample sets corresponding to the modes respectively in the target service scene, each base video data sample set comprises base video data samples corresponding to the same mode respectively, and each video service data sample set comprises video service data samples corresponding to the same mode respectively.

In one aspect, an apparatus for video classification based on multi-modal representation is provided, the apparatus comprising:

the data acquisition unit is used for acquiring data information of a target video corresponding to each mode and inputting the data information of each mode to a trained target multi-mode video representation model;

the video classification unit is used for obtaining the video service category of the target video in the target service scene output by the target multi-modal video representation model;

Optionally, the target multi-modal video representation model includes a plurality of single-modal encoding submodels, and a feature fusion submodel for performing feature fusion on single-modal feature vectors output by the plurality of single-modal encoding submodels, where each single-modal encoding submodel corresponds to one mode of a video;

the apparatus further comprises a model training unit for:

adopting the basic video data sample set corresponding to each mode to respectively perform the adaptive pre-training on the pre-training coding sub-model corresponding to each mode to obtain a single-mode coding sub-model corresponding to each mode; each pre-training coding sub-model is obtained by performing initialization pre-training on the basis of a general data set of a corresponding mode;

performing iterative training on each obtained single-mode coding sub-model and the feature fusion sub-model by adopting a video service data sample set corresponding to each mode until a set convergence condition is met;

and outputting the trained target multi-modal video representation model when a set convergence condition is met.

Optionally, the model training unit is specifically configured to:

for each video, the following operations are respectively executed:

for a video, respectively adopting each single-mode coding sub-model to perform feature coding on a video service data sample of a corresponding mode of the video to obtain a plurality of single-mode feature vectors, wherein each single-mode feature vector corresponds to one mode;

performing feature fusion on the plurality of single-mode feature vectors by using the feature fusion sub-model to obtain a video feature vector of the video;

and respectively carrying out parameter adjustment on each single-mode coding sub-model and the feature fusion sub-model based on the obtained video feature vector and the plurality of single-mode feature vectors corresponding to each video.

Optionally, each monomodal coding submodel includes at least one attention weight matrix, and the model training unit is specifically configured to:

extracting basic features of the video service data samples of each mode to obtain basic feature vectors corresponding to each mode;

respectively obtaining a weight characteristic vector set corresponding to each basic characteristic vector according to each obtained basic characteristic vector and at least one attention weight matrix included by the corresponding monomodal coding submodel; each weight feature vector included in each weight feature vector set corresponds to an attention weight matrix;

and obtaining the plurality of single-mode feature vectors according to the obtained weight feature vector sets.

Optionally, the at least one attention weight matrix includes a query vector weight matrix, a key vector weight matrix, and a value vector weight matrix; the model training unit is specifically configured to:

for each basic feature vector, the following operations are respectively executed: aiming at a basic feature vector, respectively obtaining a corresponding query vector, a corresponding key vector and a corresponding value vector according to the basic feature vector, the query vector weight matrix, the key vector weight matrix and the value vector weight matrix;

for each basic feature vector, respectively performing the following operations:

respectively obtaining the attention weight value corresponding to each basic feature vector according to the query vector of the basic feature vector and the key vector of each basic feature vector; the attention weight value is used for representing the association degree between the video data of the video in the modality corresponding to each basic feature vector and the video data of the video in the modality corresponding to the basic feature vector;

and obtaining the monomodal feature vector of the video in the mode corresponding to the basic feature vector according to the value vector and the attention weight value corresponding to each basic feature vector.

Optionally, the model training unit is specifically configured to:

vector splicing is carried out on the plurality of single-mode feature vectors according to a set mode by adopting a vector splicing layer included by the feature fusion sub-model, and a video feature vector of the video is obtained;

pooling the plurality of single-mode feature vectors by using a pooling layer included in the feature fusion sub-model to obtain a video feature vector of the video;

performing convolution operation on a feature matrix consisting of the plurality of monomodal feature vectors by using a convolution layer included in the feature fusion sub-model by using a set step length to obtain a video feature vector of the video;

and mapping the plurality of single-mode feature vectors by adopting a full connection layer included by the feature fusion sub-model to obtain the video feature vector of the video.

Optionally, the model training unit is specifically configured to:

for each single-mode coding sub-model, the following operations are respectively executed: determining the single-mode coding loss of a single-mode coding sub-model based on the single-mode feature vectors corresponding to the videos output by the single-mode coding sub-model;

determining video representation loss of the multi-modal video representation model based on the video feature vectors corresponding to the videos respectively;

obtaining model total loss of the multi-modal video representation model based on the obtained single-modal coding loss corresponding to each single-modal coding sub-model and the video representation loss;

and adjusting parameters of each single-mode coding sub-model and the feature fusion sub-model based on the total model loss.

Optionally, the model training unit is specifically configured to:

for each video, the following operations are respectively executed:

determining a predicted video service category of one video in the target video service scene based on a video feature vector of the one video;

determining a video representation loss of the video based on the obtained prediction video service category and an annotation video service type included in the service data sample of the video;

and obtaining the video representation loss of the multi-modal video representation model based on the obtained video representation loss of each video.

Optionally, one base video data sample set is any one of the following data sample sets:

a text data sample set comprising video text data samples for respective videos;

a set of image data samples comprising video image data samples of respective videos;

a set of audio data samples comprising video audio data samples of respective videos.

Optionally, the data obtaining unit is specifically configured to:

for each video, the following operations are respectively executed:

aiming at one video, text extraction is carried out on the video by adopting a text extraction method to obtain a text in the video;

splicing the text in the video with the title text and the video introduction text of the video to obtain a spliced text corresponding to the video;

and performing numerical processing on the spliced texts corresponding to the videos to obtain text indexes corresponding to the spliced texts, wherein each text index only corresponds to one spliced text.

In one aspect, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the above methods when executing the computer program.

In one aspect, a computer storage medium is provided having computer program instructions stored thereon that, when executed by a processor, implement the steps of any of the above-described methods.

In one aspect, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps of any of the methods described above.

In the embodiment of the application, when a target multi-mode video representation model is trained, firstly, the adaptability of a video domain is pre-trained on the basis of a basic video data sample, and after the pre-training is completed, the video service data sample based on a target service scene is re-trained. In addition, the multi-modal representation model is more hierarchical in learning through the staged training process, so that the multi-modal representation capability is richer, the model is convenient to adapt to different types of downstream video classification tasks, and the accuracy of the downstream video classification task is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for video classification based on multi-modal representation according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a target multi-modal video representation model provided by an embodiment of the present application;

FIG. 4 is a schematic data processing flow diagram of a target multi-modal video representation model according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a staged training of a target multi-modal video representation model according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating a composition of a basic video data sample set according to an embodiment of the present application;

fig. 7 is a schematic diagram illustrating a general pre-training flow of a video encoder of an image modality according to an embodiment of the present application;

FIG. 8 is a general pre-training flow diagram of a text encoder for a text modality according to an embodiment of the present application;

FIG. 9 is a schematic flowchart of adaptive pre-training by multi-modal collaborative learning according to an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating a data processing flow in an adaptive pre-training process according to an embodiment of the present application;

FIG. 11 is a flowchart illustrating a basic feature vector process according to an embodiment of the present application;

fig. 12 is a schematic flowchart illustrating a process of obtaining a single-mode feature vector of an image mode according to an embodiment of the present application;

fig. 13 is a schematic flowchart of a process of performing target service scenario training at stage S3 according to the embodiment of the present application;

fig. 14 is a schematic structural diagram of a video classification apparatus based on multi-modal representation according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

For the convenience of understanding the technical solutions provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained first:

modality (modality): in the embodiment of the present application, each form of information may be referred to as a modality, and for video, the media of video information includes audio, video (referred to as video images) and text, and each form of media may be referred to as a modality of video, namely, an audio modality (referred to as a), a video modality (referred to as image modality) (referred to as v) and a text modality (referred to as t).

Multimodal video representation model: the method refers to a multi-modal Representation (multi-modal Representation) model for video, the multi-modal Representation model is obtained based on a multi-modal Machine Learning (MMML) process, and the multi-modal Machine Learning aims to realize the capability of processing and understanding multi-source modal information through a Machine Learning method, such as multi-modal Learning among video, audio and semantics (text), and redundancy among modalities is eliminated by utilizing complementarity among the multi-modalities so as to learn better feature Representation. Multimodal representation learning mainly involves two main directions: joint representation (Joint Representations) and collaborative representation (Coordinated Representations), wherein the Joint representation maps information of a plurality of modalities to a uniform multi-modality vector space together, the collaborative representation is responsible for mapping each modality in the multi-modality to respective representation space respectively, but certain correlation constraint is satisfied between the mapped vectors, and in the staged training process of the application, two modes of multi-modality representation learning are involved.

General data set: the data set is a data set without specific domain or can be understood as a data set covering data of each domain, and the general data set adopted in the embodiment of the present application may adopt a network open source data set, for example, for text, the general data set may be a wikipedia data set; for images, the generic dataset may be an ImageNet dataset; for audio, the generic data set may be the YouTube-8M data set.

Set of base video data samples: the basic Video data sample set is for a Video domain, the basic Video data sample set includes basic Video data samples corresponding to the same modality, and the basic Video data sample set of each modality includes a plurality of basic Video data samples, each basic Video data sample includes basic data of one modality of one Video and a basic genre, the basic genre refers to a genre of the Video itself and does not relate to a genre of a Video classification in a specific Video service scene, for example, the basic genre may be a series, a scenario, a movie, a suspense series, a Music short (MV), and the like. For example, for an image modality, the base set of video data samples may be made up of video frame image samples of a plurality of videos, each of which may contain a plurality of video image frames of a video and a base type tag for the video.

Video service data sample set: the video service data sample set is for a specific video service scene, and the video service data sample sets corresponding to different video service scenes may be different. For each modality, there may be a corresponding set of video service data samples, each set of video service data samples including a plurality of video service data samples in the modality, and each video service data sample may include service data of one modality of one video and a video service category. For example, for an image modality, a video service data sample set may be composed of a plurality of video frame image samples of a video, and each video frame image sample may contain a plurality of video image frames of a video and related service data of the video, for example, may include a video service category of the video in the video service scene, and may also include the video.

Video service category: the video service classification method refers to the service classification of a video in a target video service scene, the video service classification can be correspondingly different based on the difference of the video service scenes, and the specific division of the video service classification can be set based on the requirements of the specific service scene.

For example, when the target video service scene is a video recommendation service scene, the classification may be performed according to the recommendation degree, that is, the classification is used to represent that the recommendation degree of a video is high or low for a certain user, such as a very recommended video, a relatively recommended video, a general recommended video, and an unrecommended video. When the target video service scene is a video operation classification scene, the video service category may be a video operation category, so as to facilitate video distribution by an operator. When the target video service scene identifies the scene for the title party, then the video service category may characterize how well the video title conforms to the video content, e.g., conforms to or does not conform to.

Attention (attention) mechanism: the essence of the attention mechanism is that human vision attention mechanism, when people perceive things, people generally do not see a scene from head to tail, but see a specific part according to needs, and when people find that a scene often appears something they want to observe in a certain part, people can learn to pay attention to the part when similar scenes reappear in the future. Therefore, the attention mechanism is essentially a means of screening out high-value information from a large amount of information in which different information has different importance to the result, and this importance can be reflected by giving weights of different sizes, in other words, the attention mechanism can be understood as a rule of assigning weights when synthesizing a plurality of sources. Specifically, the attention mechanism realizes mapping from query and a series of key-value pairs (key-value pair) to output results (output), wherein the query, the key and the values are vectors, the output results are calculated by performing weighted summation on the values, and the weight corresponding to each value is calculated by the query and the key through a compatibility function.

Namely:

attention (Query) indicates that an obtained output result is obtained based on an Attention mechanism, Similarity can be a weight value calculated based on a Softmax normalization function, parameters of the Softmax normalization function are Query vectors, Source is a key value pair set, and L is a value of a Query vector, wherein the value of the Query vector is a value of a key value pair set_xIs the length of the Source, or is the number of key-value pairs in the Source. The above description is directed to a single attention mechanism, and it is sufficient to adopt a group of query in the single attention mechanism, and the multi-head attention mechanism includes multiple groups of queries, and after obtaining a calculation result by performing weighted average on each group of query, finally, performing operations such as stitching on the multiple groups of results to obtain a final result.

Embodiments of the present application relate to artificial intelligence and Machine Learning (ML) technologies, and are designed based mainly on Machine Learning in artificial intelligence.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmit the image to an instrument to detect. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like. An Artificial Neural Network (ANN) abstracts a human brain neuron Network from an information processing perspective, establishes a certain simple model, and forms different networks according to different connection modes. The neural network is an operational model, which is formed by connecting a large number of nodes (or neurons) with each other, each node represents a specific output function called excitation function (activation function), the connection between every two nodes represents a weighted value for passing through the connection signal called weight, which is equivalent to the memory of the artificial neural network, the output of the network is different according to the connection mode of the network, the weighted value and the excitation function are different, and the network itself is usually an approximation to a certain algorithm or function in nature, and may also be an expression of a logic strategy.

When the characteristics of the video are expressed, the artificial neural network model based on deep learning is adopted. The video feature representation in the embodiments of the present application may be divided into two parts, including a training part and an application part. In the training part, an artificial neural network model (namely a multi-mode video representation model mentioned later) is trained by the machine learning technology, so that the artificial neural network model is trained in stages based on multi-mode data of each video given in the embodiment of the application, and model parameters are continuously adjusted by an optimization algorithm until the model converges; the application part is used for performing feature representation on the video by using the coding part in the artificial neural network model obtained by training of the training part, performing classification prediction on the video based on the obtained video feature vector, and the like. In addition, it should be further noted that, in the embodiment of the present application, the artificial neural network model may be trained online or offline, and is not limited herein. This is exemplified herein by off-line training.

The following briefly introduces the design concept of the embodiments of the present application.

Generally, the multi-modal representation model has too many parameters, so that the model training difficulty is high, the model is difficult to converge, and the time consumption is long.

Considering that when a multi-modal representation model is trained, the model is often directly trained by means of an open source pre-training model, and adaptive transition learning of different modes in the current service scene is neglected, so that the model is difficult to train and a better effect is obtained, therefore, the adaptive training of the model can be considered to be increased, and the model has better adaptability to the current service scene.

In view of this, an embodiment of the present application provides a video classification method based on multi-modal representation, in the method, when a target multi-modal video representation model is trained, firstly, adaptive pre-training in a video domain is performed on the model based on a basic video data sample, and after the pre-training is completed, re-training is performed based on a video service data sample of a target service scene, so that the model has a certain preliminary expression capability by performing the pre-training in the video domain on the model, and further, the model training difficulty of a downstream video related task is lower, the model is easier to converge, the time consumed by the model training is correspondingly reduced, and the model training efficiency is higher. In addition, the multi-modal representation model is more hierarchical in learning through the staged training process, so that the multi-modal representation capability is richer, the model is convenient to adapt to different types of downstream video classification tasks, and the accuracy of the downstream video classification task is improved.

In addition, direct end-to-end training is considered, and a large number of parameters make the model difficult to converge, so that multi-task learning can be performed on different modes of the multi-mode representation model, the representation capability of a single mode is enriched, and the universality of different downstream tasks is improved. Therefore, in the embodiment of the application, in the pre-training stage, the model convergence difficulty is reduced by adopting the sample data of different modes to perform the feature learning of the corresponding modes.

After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In a specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The scheme provided by the embodiment of the application can be suitable for most video service classification scenes, such as video recommendation classification, video operation classification and video title party identification service scenes. As shown in fig. 1, an application scenario provided in the embodiment of the present application is schematically illustrated, and in this scenario, a terminal device 101 and a server 102 may be included.

The terminal device 101 may be, for example, a mobile phone, a tablet computer (PAD), a laptop computer, a desktop computer, a smart television, a smart wearable device, and the like. The terminal device 101 may be installed with an application capable of performing a video service, such as a browser or a video client, and a user may perform a corresponding video service on the application. The application related to the embodiment of the application can be a software client, and can also be a client such as a webpage and an applet, and the background server is a background server corresponding to the software or the webpage and the applet, and the specific type of the client is not limited.

The server 102 may be a background server corresponding to a client installed on the terminal device 101, for example, an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.

The server 102 may include one or more processors 1021, memory 1022, and an I/O interface 1023 to interact with the terminal, among other things. In addition, the server 102 may further configure a database 1024, and the database 1024 may be used to store data information of each modality of each video, trained model parameters, and the like. The memory 1022 of the server 102 may further store program instructions of the video classification method based on multi-modal representation provided in the embodiment of the present application, and when executed by the processor 1021, the program instructions can be used to implement the steps of the video classification method based on multi-modal representation provided in the embodiment of the present application, so as to obtain the video service category of the video in the target service scene.

Terminal device 101 and server 102 may be communicatively coupled directly or indirectly through one or more networks 103. The network 103 may be a wired network or a Wireless network, for example, the Wireless network may be a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, or may also be other possible networks, which is not limited in this embodiment of the present invention.

In the embodiment of the present application, the number of the terminal apparatuses 101 may be one, or may be multiple, and similarly, the number of the servers 102 may also be one, or may be multiple, that is, the number of the terminal apparatuses 101 or the servers 102 is not limited.

In a possible implementation manner, the video classification method according to the embodiment of the present application may be applied to a video recommendation classification scenario, so that an application installed on the terminal device 101 may be a video playing client provided for a user, and for each video, the recommendation degree for each user may be output by using the video classification method according to the embodiment of the present application, so that video recommendation is performed for each user based on the recommendation degree, videos recommended to the user may be presented on the video playing client, and the user may view the videos based on the video playing client.

In a possible implementation manner, the video classification method according to the embodiment of the present application may be applied to an operation classification scenario, and then an application installed on the terminal device 101 may be a video management client provided for a video operator, and for each video, the operation type of the video may be output by using the video classification method according to the embodiment of the present application, so as to assist the operator in video distribution, and the operator does not need to consider to determine the operation type of the video.

In a possible application scenario, the account data (such as a historical operation object sequence and a corresponding operation object attribute sequence), the similar account relationship, and the model parameter in the application may be stored by using a cloud storage technology. The distributed cloud storage system refers to a storage system which integrates a large number of storage devices (or called storage nodes) of different types in a network through application software or application interfaces to cooperatively work together through functions of cluster application, grid technology, distributed storage file system and the like, and provides data storage and service access functions to the outside.

In one possible application scenario, the servers 102 may be deployed in different regions for reducing communication delay, or different servers 102 may serve the regions corresponding to the terminal devices 10 respectively for load balancing. The plurality of servers 102 share data through a blockchain, and the plurality of servers 102 correspond to a data sharing system formed by the plurality of servers 102. For example, the terminal device 101 is located at a site a and is in communication connection with the server 102, and the terminal device 101 is located at a site b and is in communication connection with other servers 102.

Each server 102 in the data sharing system has a node identifier corresponding to the server 102, and each server 102 in the data sharing system may store node identifiers of other servers 102 in the data sharing system, so that the generated block is broadcast to other servers 102 in the data sharing system according to the node identifiers of other servers 102. Each server 102 may maintain a node identifier list as shown in the following table, and store the server 102 name and the node identifier in the node identifier list. The node identifier may be an Internet Protocol (IP) address and any other information that can be used to identify the node.

Of course, the method provided in the embodiment of the present application is not limited to be used in the application scenario shown in fig. 1, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described in the following method embodiments, and will not be described in detail herein.

Referring to fig. 2, a schematic flowchart of a video classification method based on multi-modal representation according to an embodiment of the present application is provided, where the method may be executed by the server 102 or the terminal device 101 in fig. 1, or may be executed by both the server 102 and the terminal device 101, where the server 102 is mainly used to execute an example for description, and a flow of the method is described as follows.

Step 201: and acquiring data information of the target video corresponding to each mode, and inputting the data information of each mode to the trained multi-mode video representation model of the target.

In the embodiment of the present application, the data information of the target video corresponding to each modality may include one or more of the following information:

(1) text data information in the target video.

The text data information may include all possible text contents related to the target video, for example, in addition to a title and a video introduction of the video, the text data information may also include texts presented on a video picture in the target video, such as lyrics, subtitles, product texts, video voice texts, and the like, after the text contents are obtained, the text contents may be spliced to obtain a spliced text corresponding to the target video, and the spliced text may be processed in a numeric processing manner, for example, by using a dictionary to digitize the spliced text.

The product text refers to a text (such as a product name, a manufacturer, and the like) on a product presented in a video picture, and may be extracted from the video picture by using an Optical Character Recognition (OCR) technology with respect to lyrics, subtitles, a product text, and the like, the video Speech text refers to a text based on Speech conversion in a target video (such as a Character dialog in a television show, lyrics in a MV, and the like), and the text content may be extracted by using an Automatic Speech Recognition (ASR) technology with respect to the video Speech text.

(2) Image data information in target video

The text data information refers to video frames in the target video, the video images may include all images or partial images of the target video, and when the video images include partial images of the target video, partial video frames may be extracted from the target video, and the video frames may be, for example, key frames of the target video, or a plurality of video frames that are selected to be capable of reflecting video content most after image recognition is performed on each video frame by a certain image recognition method. For example, in practical applications, FFMPEG (a multimedia processing tool) can be used to extract video frames of a specific length (e.g., 30 frames) from the target video.

(3) Audio data information in target video

The audio data information may include the complete audio or partial audio of the target video, and when the video audio includes the partial audio of the target video, a partial audio clip may be extracted from the complete audio of the target video. The partial audio clips may be continuous audio clips in the complete audio of the target video, or may be obtained by extracting multiple audio clips and then splicing the audio clips. For example, FFMPEG can be used to extract the audio and convert it into a Mel-frequency-cepstrum feature.

Step 202: obtaining the video service category of a target video output by the target multi-modal video representation model in a target service scene; the target multi-modal video representation model is obtained by performing video frequency domain adaptive pre-training on the basis of base video data sample sets corresponding to the modes respectively and performing retraining on the basis of video service data sample sets corresponding to the modes respectively in a target service scene, each base video data sample set comprises base video data samples corresponding to the same mode respectively, and each video service data sample set comprises video service data samples corresponding to the same mode respectively.

In the embodiment of the application, the obtained data information of the multiple modes of the target video is input into the target multi-mode video representation model, and after the data information is processed by the target multi-mode video representation model, the video service category of the target video output by the target multi-mode video representation model in the target service scene can be obtained, so that the related video service in the target service scene can be developed according to the video service category of the target video.

The target multi-modal video representation model is obtained by performing adaptive pre-training of a video domain based on a base video data sample set corresponding to each modality and retraining based on a video service data sample set corresponding to each modality in a target service scene, and the multi-modal representation model is more hierarchical in learning through a staged training process, so that the multi-modal representation capability is richer, the model can conveniently adapt to different types of downstream video classification tasks, and the accuracy of the downstream video classification task is further improved.

Referring to fig. 3, a schematic diagram of a structure of a target multi-modal video representation model is shown. The target multi-modal video representation model includes a plurality of single-modal encoding submodels and a feature fusion submodel, one single-modal encoding submodel corresponds to one modality and is used for performing feature encoding on data information of the one modality, as shown in fig. 3, the video encoder, the text encoder and the audio encoder are all single-modal encoding submodels.

(1) Video encoder

Since the video encoder is adapted to an image modality and performs characteristic encoding of image data information of a target video, any encoder configuration capable of performing video encoding may be adopted, and for example, configurations such as an occupation _ renet _ V1, an effectiveness-net, an occupation, a renet, or an occupation V4 may be adopted, which are not exemplified herein.

(2) Text encoder

The text encoder corresponds to a text modality for encoding characteristics of image data information of a target video, and generally, any encoder structure capable of performing text encoding can be adopted, for example, an albert, a bert or an eletra structure can be adopted, which is not specifically exemplified herein.

(3) Audio encoder

The audio encoder corresponds to an audio modality for characterizing the audio data information of the target video. In general, any encoder structure capable of audio encoding can be adopted, for example, a vggish structure can be adopted, which is not specifically exemplified here.

Based on the target multi-modal video representation model of FIG. 3, the above step 202 can include the following steps 2021-2024. Referring to fig. 4, a schematic diagram of a data processing flow for a model representation of a target multimodal video is shown.

Step 2021: and carrying out feature coding on the data information of the corresponding mode by utilizing each single-mode coding sub-model to respectively obtain the basic feature vector of each mode.

As shown in fig. 3, for an image modality, a video encoder is used to perform feature encoding on image data information of a target video to obtain an image basis feature vector, and similarly, for a text modality, a text encoder is used to obtain a text basis feature vector, and for an audio modality, an audio encoder is used to obtain an audio basis feature vector.

Step 2022: and obtaining the single-mode feature vector of each mode based on the obtained basic feature vectors.

Since the data information of each modality is from the same video, there is a certain relation to a certain extent, and therefore, as shown in fig. 3, after feature coding is performed by each encoder to obtain a basic feature vector, in order to make the feature vector representation of each modality more accurate, information of other modalities can be blended into the feature vector of each modality based on a multi-modality collaborative representation manner, so that finally obtained single-modality feature vectors satisfy a certain correlation constraint.

In a possible implementation manner, the feature of each modality may be fused by using a transform or an encoder portion in an encoder-decoder to obtain a final representation of each modality, that is, to obtain a single-modality feature vector of each modality, and a specific process implemented by using an attention mechanism is specifically described in a subsequent training process, which is not described herein in detail. Of course, other possible ways may be adopted to perform multi-modal collaborative representation, such as L2 regularization, and the like, which is not limited by the embodiments of the present application.

Step 2023: and performing feature fusion by using the feature fusion sub-model based on the obtained single-mode feature vectors to obtain the video feature vectors.

In the embodiment of the present application, the feature fusion refers to integrating information included in each single-mode feature vector, and the feature fusion may be implemented in any one of the following manners:

(1) vector stitching

When feature fusion is performed in a vector splicing manner, the feature fusion sub-model may include a vector splicing layer, so that vector splicing is performed on a plurality of single-mode feature vectors according to a set manner through the vector splicing layer, and a video feature vector of the target video is obtained. For example, the individual single-mode feature vectors may be concatenated after the previous single-mode feature vector, such as concatenating the single-mode feature vector of the text modality after the single-mode feature vector of the image modality, and concatenating the single-mode feature vector of the audio modality after the single-mode feature vector of the text modality.

After the feature fusion is carried out by adopting a vector splicing mode, the obtained spliced feature vector has more dimensionality, and a certain feature simplification mode can be adopted to carry out feature simplification on the spliced feature vector, so that the video feature vector is obtained.

(2) Pool characterization (pooling)

When feature fusion is performed in a feature pooling manner, the feature fusion sub-model may include a pooling layer, so as to perform pooling processing on a plurality of single-mode feature vectors through the pooling layer, and obtain video feature vectors of the target video.

Specifically, the pooling treatment may be performed by using a pooling treatment manner such as max-pooling (max-pooling) or mean-pooling (mean-pooling), which is not limited in the embodiment of the present application.

(3) Convolution (Convolution) processing

When feature fusion is performed through convolution processing, the feature fusion sub-model may include a convolution layer, so that a feature matrix composed of a plurality of single-mode feature vectors is subjected to convolution operation through the convolution layer by using a set step length to obtain a video feature vector of the target video.

Specifically, the convolutional layer may include at least one weight matrix, and parameters in the weight matrix may be obtained through training, and the feature matrix is convolved by the weight matrix, so as to obtain a video feature vector of the target video.

(4) Full join processing

When feature fusion is performed through full-connected processing, the feature fusion sub-model may include a full-connected layer (FC), and map a plurality of single-mode feature vectors to obtain video feature vectors of the target video.

Step 2024: and performing type prediction based on the video feature vector to obtain the video service class of the target video.

In the embodiment of the present application, the type prediction may be implemented by using any possible classifier, for example, FC or softmax methods may be used.

In the following, a specific description will be given of the training process of the target multi-modal video representation model. Referring to fig. 5, a schematic diagram of a staged training process of a target multi-modal video representation model is shown, where the training process specifically includes 3 stages, namely, a general pre-training stage, a video domain adaptive pre-training stage, and a target business scenario training stage.

S1: a general pre-training phase.

In the general pre-training stage, the learning of general data of each modality is mainly performed to obtain a preliminary pre-training encoding sub-model, as shown in fig. 5, and in the general pre-training stage, the initially constructed multi-modal video representation model is trained by using a general data set.

In the embodiment of the application, for a video encoder in an image modality, a general data set in the image modality, for example, an open source ImageNet data set, may be adopted to perform image classification task learning, and an obtained video encoder is denoted as g (v); for an audio encoder of an audio modality, a common data set of the audio modality, for example, an open source youtube-8M data set, may be adopted for audio classification task learning, and the obtained audio encoder may be denoted as g (a); for a text encoder of a text modality, a general dataset of the text modality, for example, open source wikipedia data, may be adopted to perform mask language model task (masked language model) learning on the text encoder, and the obtained text encoder may be denoted as g (t).

It should be noted that the general pre-training stage is an optional stage, and when the method is applied specifically, the existing pre-training models of each modality can be directly obtained, that is, the parameters of the pre-training models are borrowed, so that the process of the general pre-training stage can be omitted, and the model training efficiency is further improved.

S2: a video domain adaptive pre-training phase.

After each single-mode coding sub-model is pre-trained in the stage of S1, the single-mode coding sub-models of each mode can be represented preliminarily, and then adaptive pre-training in the second stage, that is, adaptive pre-training in the video domain, can be entered. Performing adaptive pre-training in a video domain, mainly learning video basic data of each mode to obtain a single-mode coding sub-model capable of performing video coding, and performing the adaptive pre-training on a pre-training coding sub-model (obtained in the S1 stage) corresponding to each mode by adopting a basic video data sample set corresponding to each mode at the S2 stage to obtain a single-mode coding sub-model corresponding to each mode; as shown in fig. 5, in the video domain adaptive pre-training stage, each pre-training coding sub-model obtained in the S1 stage is trained again by using the basic video data sample set.

Specifically, each base video data sample set includes base video data samples corresponding to the same modality for respective videos, which may be videos derived from one or more video platforms. Referring to fig. 6, a schematic diagram of the composition of a base video data sample set is shown. If N videos are collected from a certain video platform, the following basic video data sample sets can be formed through the data information of each modality of the N videos:

(1) set of image data samples V

The set of image data samples V comprises N image data samples, see V1-vn shown in fig. 6, each image data sample corresponding to a video and including image data from the corresponding video, and each image data sample is labeled for video classification for subsequent supervised learning of the video encoder, i.e. each image data sample may further comprise a video base type label (i.e. the video label of fig. 6), such as a series, a movie, a swordplay, or a situation play.

The image data may be an original video frame extracted from a video, or may be image data obtained by performing a certain image processing, for example, pixel value data obtained by extracting pixel values from the extracted original video frame.

(2) Set of audio data samples A

The audio data sample set a includes N audio data samples, see a 1-an shown in fig. 6, each audio data sample corresponds to a video and includes audio data from the corresponding video, and similarly, a video classification label is performed on each audio data sample, that is, each audio data sample may further include a video base type label. The video classification labeling can be performed only for one video, and after video classification labeling is performed on a certain video, the corresponding audio data sample and image data sample can be associated with the video basic type label.

The audio data may be an original audio segment extracted from a video, or may be audio data obtained through certain audio Processing, for example, audio features obtained by performing audio Signal Processing (Signal Processing) on the extracted original audio segment, for example, converting the extracted original audio segment into Mel-Frequency Spectrum (MFC) features.

(3) Set of text data samples T

The text data sample set T includes N text data samples, see T1-tn shown in fig. 6, each corresponding to one video and including text data from the corresponding video. The training of the text encoder may be performed in an unsupervised manner, for example, in a masked language model task learning manner.

The text data samples are taken from various aspects of text included in the respective videos. Specifically, when the text data sample of the video a is obtained, a text extraction method may be used to extract a text from the video to obtain a text in the video, for example, an OCR technology may be used to extract a text from a video image, and an ASR technology may be used to extract a text from a voice in the video to obtain a text in the video, and then text contents such as a title text and a video introduction text of the video a, and a text in the video of the video a may be subjected to a stitching process to obtain a stitched text corresponding to the video a.

And then, performing numerical processing on the spliced texts corresponding to the videos to obtain text indexes corresponding to the spliced texts, wherein each text index is used for representing the text content included in the video A, each text index uniquely corresponds to one spliced text, and the text indexes can be used for participating in a training process in a subsequent training process.

In practical applications, a base video data sample set may be any one of the above data sample sets.

In the embodiment of the present application, for a video encoder in an image modality (after training at the S1 stage is completed), classification task learning may be performed by using the image data sample set V, and an obtained video encoder is denoted as d (V); for an audio encoder in an audio modality, the audio data sample set a can be adopted to perform classification task learning, and the obtained audio encoder can be represented as d (a); for a text encoder in a text mode, the text data sample set T may be adopted to perform mask language model task learning on the text encoder, and the obtained text encoder may be denoted as d (T).

It should be noted that, at the stage S2, there is an association relationship between the data sets of the respective modalities, that is, the image data sample, the audio data sample, and the text data sample can all be sampled in the same video, so at the stage S2, a multi-modal collaborative learning manner can be adopted for training, so that each single-modality coding sub-model that is trained satisfies a certain association relationship to a certain extent. Of course, training in a multi-modal collaborative learning manner is only an optional implementation manner, and is specifically selected according to actual requirements at a specific implementation moment.

S3: and a target business scenario training stage.

After adaptive pre-training in the S2 stage, the single-mode coding sub-models of each mode can adapt to the expression of the video domain, and further, training of a specific service scenario in the third stage, that is, training of a target service scenario, can be performed. And (4) training a target service scene, and mainly learning video service data of each mode. As shown in fig. 5, in the target service scene training phase, each single-mode coding sub-model obtained in the S2 phase and other parts of the target multi-mode video representation model, such as the feature fusion sub-model, are iteratively trained by using the video service data sample set corresponding to each mode, that is, in the target service scene training phase, the complete target multi-mode video representation model can be trained end to end.

Specifically, each video service data sample set includes video service data samples corresponding to the same modality, and each video may be a video originating from a specific service scene of the video platform. Similar to the set composition shown in fig. 6, the video service data sample set may include an image data sample set V, an audio data sample set a, and a text data sample set T, and besides, a video service category is labeled for each video.

It should be noted that, different from the video basic type tag at the S2 stage, the video service category labeled at the S3 stage mainly refers to a type in a target service scene, such as recommendation level, whether the video service data sample is a headline (which means that video content does not conform to a video title), or an operation category, and the like, of course, the video service data sample in each video service data sample set may also carry a video basic type tag similar to the S2 stage, but the video basic type tag at the S3 stage may be the same as or different from the video basic type tag at the S2 stage.

For example, in stage S2, for an image data sample of a video, which may include image data of the video and a basic genre (e.g., a tv drama, a movie, or a MV) to which the video belongs, in stage S3, for an image data sample of a video, image data of the video and a basic genre (e.g., a swordsman drama, or a city situation drama) to which the video belongs may also be included, that is, the video basic genre label in stage S3 may be different from that in stage S2, and at the same time, the video service genre labels of the respective modalities of a video are the same for the same service scenario (e.g., for a banquet or not a banquet).

In the embodiment of the present application, for a video encoder in an image modality (after training in the S2 stage is completed), a classification task learning may be performed by using an image data sample set V in the S3 stage, and an obtained video encoder is denoted as t (V); for an audio encoder in an audio modality, an audio data sample set a in the S3 stage may be adopted to perform classification task learning, and the obtained audio encoder may be denoted as t (a); for the text encoder of the text modality, the text data sample set T in the stage S3 may be adopted to perform mask language model task learning on the text encoder, and the obtained text encoder may be denoted as T (T), and accordingly, the classifiers c (v), c (a), and c (T) of the classification loss function of each modality may be respectively constructed.

Further, performing feature fusion on features obtained by coding t (v), t (a) and t (t), performing joint feature learning through a multilayer feedforward neural network to obtain a final representation f of the target multi-modal video representation model, and constructing a classifier c (f) based on a classification loss function according to the features.

Finally, summing up the single-mode classifiers c (v), c (a), c (t) and c (f) for joint learning to obtain a final target multi-mode video representation model.

Next, a general pre-training procedure for each modality at stage S1 will be described.

Since the training processes of the single-mode encoding submodels of the image mode and the audio mode are similar, the image mode is specifically described herein by way of example, and refer to fig. 7, which is a general pre-training flowchart diagram of a video encoder of the image mode.

Step 701: and performing feature coding on each image data sample in the universal data set by adopting a video coder to obtain an image feature vector.

Step 702: based on the feature vectors of the respective images, prediction types of the respective images are obtained.

Step 703: and calculating the coding loss of the video coder based on the prediction type and the annotation type.

In the embodiment of the present application, the coding loss may be calculated by using a Cross-entropy loss function (Cross-entropy loss function), and of course, other possible loss functions may also be used, such as a 0-1 loss function (zero-one loss), which is not limited in this embodiment of the present application.

Step 704: and judging whether the video encoder reaches a convergence condition or not.

The convergence condition may include any one of the following conditions:

(1) the coding loss of the video encoder is not greater than the set loss threshold.

(2) The iteration number of the video encoder is larger than a set number threshold.

Step 705: if the determination result in the step 704 is negative, adjusting the parameters of the video encoder based on the coding loss, and entering the next iteration process based on the adjusted video encoder, that is, skipping to the step 701.

Step 706: if the determination result in step 704 is yes, the training is ended.

Referring to fig. 8, a general pre-training flow diagram of a text encoder for a text modality is shown.

Step 801: for each text data sample, randomly masking a portion of the text content of the text data sample.

Step 802: and performing feature coding on the text content which is not covered in each text data sample by adopting a text coder to obtain a text feature vector.

Step 803: based on the respective text feature vectors, the masked text content in the respective text data sample is predicted.

Step 804: the coding loss of the text encoder is calculated based on the predicted masked text content and the actual masked text content.

In the embodiment of the present application, the coding loss of the text encoder may also be calculated by using a cross entropy loss function, and of course, other possible loss functions may also be used, such as a 0-1 loss function, which is not limited in this embodiment of the present application.

Step 805: it is determined whether the text encoder has reached a convergence condition.

The convergence condition may include any one of the following conditions:

(1) the encoding loss of the text encoder is not more than the set loss threshold.

(2) The iteration number of the text encoder is larger than a set number threshold.

Step 806: if the determination result in the step 805 is negative, adjusting the parameters of the text encoder based on the encoding loss, and entering the next iteration process based on the adjusted text encoder, that is, skipping to the step 801.

Step 807: if the determination result in step 805 is yes, the training is ended.

Next, the adaptive pre-training process of each modality at stage S2 will be described.

In a possible embodiment, in the S2 stage, the training process of the single-mode coding sub-model of each mode may be the same as the training process in the S1 stage, but the training data set is a base video data sample set in the video domain, and therefore, for this training mode, reference may be made to the description of the corresponding parts of fig. 7 and fig. 8, and details are not repeated here.

It should be noted that, for the training of the text encoder, in addition to the mask training mode, the blending of the video type label may be added, so that the training of the text encoder can be more suitable for the video field.

In another possible implementation, the training can also be performed by multi-modal collaborative learning. Referring to fig. 9, a schematic flow chart of adaptive pre-training in a multi-modal collaborative learning manner is shown. Since the basic video data sample set is a basic video data sample set corresponding to each modality for each video, one video may correspond to basic data samples of multiple modalities, and the data processing process for each video in the training process is similar, the processing of the video a is specifically described herein as an example.

Step 901: and extracting basic features of the basic data samples of each mode of the video A by adopting a single-mode coding sub-model to obtain basic feature vectors.

Referring to fig. 10, a schematic diagram of a data processing flow in the adaptive pre-training process is shown. Extracting basic features of image data samples of the video A by adopting a video encoder to obtain image basic feature vectors; extracting basic features of an audio data sample of the video A by adopting an audio encoder to obtain an audio basic feature vector; and extracting basic features of the text data sample of the video A by adopting a text encoder to obtain a text basic feature vector.

Step 902: and respectively obtaining a weight characteristic vector set of each mode according to each obtained basic characteristic vector and at least one attention weight matrix included in each single-mode coding sub-model.

In the embodiment of the application, each single-mode coding sub-model comprises at least one attention weight matrix, and further, according to each obtained basic feature vector and at least one attention weight matrix included in the corresponding single-mode coding sub-model, a weight feature vector set corresponding to each basic feature vector is obtained respectively; wherein each weight feature vector included in each weight feature vector set corresponds to an attention weight matrix.

Referring to fig. 10, for an image modality, the image basis feature vectors may be multiplied by the attention weight matrixes respectively to obtain corresponding image weight feature vectors, and one image weight feature vector corresponds to one attention weight matrix. Similar to the image mode, the audio basic feature vector can be multiplied by each attention weight matrix respectively to obtain a corresponding audio weight feature vector, and the text basic feature vector can be multiplied by each attention weight matrix respectively to obtain a corresponding text weight feature vector.

Fig. 11 is a flow chart illustrating the processing of the basis feature vector. Wherein the at least one attention weight matrix may comprise a query vector weight matrix W_QKey vector weight matrix Q_KSum vector weight matrix Q_V. For a basic feature vector, the basic feature vector is respectively associated with the query vector weight matrix W_QKey vector weight momentArray Q_KSum vector weight matrix Q_VAfter multiplication, a corresponding image weight feature vector set can be obtained, wherein the image weight feature vector set comprises a query vector Q, a key vector K and a value vector V.

Step 903: and respectively obtaining the single-mode feature vector of each mode according to the obtained weight feature vector sets.

In the embodiment of the application, collaborative learning is performed according to the obtained weight feature vector sets to obtain single-mode feature vectors of each mode.

Specifically, for each basic feature vector, such as an image basic feature vector, an attention weight value corresponding to each basic feature vector may be obtained according to a query vector of the image basic feature vector and a key vector of each basic feature vector, where the attention weight value is used to represent a degree of association between video data of a video in a modality corresponding to each basic feature vector and video data of a video in a modality corresponding to the image basic feature vector; furthermore, according to the value vector and the attention weight value corresponding to each basic feature vector, a single-mode feature vector of a video in a mode corresponding to the image basic feature vector can be obtained.

Since the process of obtaining the single-mode feature vector is similar for each mode, the image mode is specifically taken as an example for explanation.

Referring to fig. 12, a flow chart for obtaining a single-mode feature vector of an image modality is shown. After the sets of the weight feature vectors of the image modality, the text modality and the audio modality are respectively obtained, the respective attention weight values corresponding to the modalities can be obtained by multiplying the query vector Q of the image modality by the key vector K of each modality (including the image modality itself), the attention weight values are used for representing the video data of the video a in each modality and the association degree of the video data of the image modality, and then the single-modality feature vectors of the image modality are obtained by performing weighted summation according to the value vector V of each modality and the respective attention weight values corresponding to each modality.

Step 904: and respectively predicting the basic type of the video A based on the single-mode feature vectors of the modes.

Namely, for the image modality, the single-modality feature vector of the image modality is utilized to obtain the basic type of the video A, and the audio modality and the text modality are the same.

Step 905: and respectively calculating the single-mode coding loss of each single-mode coding sub-model aiming at the video A based on the prediction basic type and the marking basic type of each mode.

Specifically, based on the prediction basis type and the annotation basis type of the image modality, the video coding loss of the video encoder for the video a can be obtained, and similarly, the single-modality coding loss of the audio encoder and the text encoder for the video a can also be obtained respectively.

Similarly, for other videos except the video a, the above process may also be adopted to obtain the coding loss corresponding to each single-mode coding sub-model, and further, for each single-mode coding sub-model, the coding losses of all videos are integrated to obtain the coding loss of the single-mode coding sub-model.

Step 906: and judging whether each single-mode coding sub-model reaches a convergence condition or not.

Step 907: if the judgment result in the step 906 is negative, adjusting respective parameters based on the coding loss of each single-mode coding sub-model, and entering the next iteration process based on the adjusted single-mode coding sub-model, that is, jumping to the step 901.

Step 908: if the determination result in step 908 is yes, that is, each of the single-mode encoding submodels converges, the training is ended.

Next, the training process of the end-to-end target multi-modal video representation model at stage S3 will be described. Referring to fig. 13, a schematic flow chart of the target business scenario training performed at stage S3 is shown. Since the video service data sample set is a video service data sample set in which each video corresponds to each modality, one video may correspond to service data samples of multiple modalities, and the data processing process for each video in the training process is similar, the processing of video a is also taken as an example here.

Step 1301: and respectively adopting each single-mode coding sub-model to perform feature coding on the video service data sample of the corresponding mode of the video A to obtain a plurality of single-mode feature vectors, wherein each single-mode feature vector corresponds to one mode.

The process of performing feature coding on the video service data samples of the corresponding modality of the video a is the same as the process of the embodiment shown in fig. 9, so for the specific process of step 1301, reference may be made to the description of the embodiment shown in fig. 9, and details are not repeated here.

Step 1302: and determining the single-mode coding loss of each single-mode coding sub-model aiming at the video A based on the single-mode feature vector of the video A output by each single-mode coding sub-model.

Specifically, based on the single-mode feature vector of the image mode, the prediction base type of the video a can be obtained, and further, according to the prediction base type and the annotation base type, the video coding loss of the video encoder for the video a can be obtained, and similarly, the single-mode coding loss of the audio encoder and the text encoder for the video a can also be obtained respectively.

Similarly, for other videos except the video a, the single-mode coding loss corresponding to each single-mode coding sub-model may be obtained based on each single-mode feature vector of each video, and then the coding losses of all videos are integrated for each single-mode coding sub-model to obtain the single-mode coding loss of the single-mode coding sub-model.

Step 1303: and performing feature fusion on the plurality of single-mode feature vectors of the video A by adopting a feature fusion sub-model to obtain the video feature vector of the video A.

Specifically, feature fusion may be performed in any of the following ways:

(1) and carrying out vector splicing on a plurality of single-mode feature vectors of the video A according to a set mode by adopting a vector splicing layer included by the feature fusion sub-model to obtain the video feature vector of the video A.

(2) And performing pooling processing on a plurality of single-mode feature vectors of the video A by using a pooling layer included by the feature fusion sub-model to obtain the video feature vector of the video A.

(3) And performing convolution operation on a feature matrix consisting of a plurality of monomodal feature vectors of the video A by adopting a convolution layer included in the feature fusion sub-model by adopting a set step length to obtain the video feature vector of the video A.

(4) And mapping a plurality of single-mode feature vectors of the video A by adopting a full connection layer included by the feature fusion sub-model to obtain the video feature vectors of the video A.

Step 1304: based on the video feature vector of the video A, determining the video representation loss of the multi-modal video representation model for the video A.

Specifically, after the video feature vector of the video a is obtained, the video service category of the video a can be obtained through prediction based on the video feature vector of the video a, and further, the video representation loss of the target multi-modal video representation model for the video a can be obtained according to the predicted video service category and the labeled video service category.

Similarly, for other videos except the video a, corresponding video representation loss can be obtained based on the video feature vector of each video, and then the video representation loss of all the videos is integrated to obtain the video representation loss of the target multi-modal video representation model.

In the embodiment of the application, the single-mode coding loss is used for representing the model loss of each single-mode coding sub-model, and the video representation loss is used for representing the overall model loss of the target multi-mode video representation model.

Step 1305: and obtaining the total model loss of the target multi-modal video representation model based on the single-modal coding loss and the video representation loss.

The modal coding loss and the video representation loss of the embodiment of the present application may be calculated by using a cross entropy loss function, and of course, other possible loss functions, such as a 0-1 loss function, may also be used. After each single-mode coding loss and each video representation loss, the model total loss may be obtained by summing based on each single-mode coding loss and each video representation loss, or by giving a certain weight to each loss, and further by performing weighted summation based on each single-mode coding loss and each video representation loss.

Step 1306: and judging whether the target multi-modal video representation model reaches a convergence condition or not.

Step 1307: and when the judgment result in the step 1306 is negative, performing parameter adjustment on each single-mode coding sub-model and each characteristic fusion sub-model based on the total model loss.

In the embodiment of the application, in addition to each single-mode encoding submodel and the feature fusion submodel, other parameters of the target multi-mode video representation model may also be used as training parameters, so that when the parameters are adjusted, the parameters are also adjusted together, and the adjusted target multi-mode video representation model is used to enter the next iterative training process, that is, the process jumps to step 1301.

Step 1308: when the judgment result of the step 1306 is yes, the model training is ended.

In the embodiment of the application, through a multi-stage multi-task training method, different tasks of different modes are combined on data sets of different granularities to pre-train a multi-mode video representation model, so that the multi-mode video representation model has certain preliminary representation capability, model training is accelerated, time consumption of model training is reduced, learning difficulty is reduced, the multi-mode video representation model is easier to converge in the learning process of a downstream task, and the multi-mode video representation model is more hierarchical due to the fact that the multi-mode video representation model is pre-trained on the data sets of different granularities, the model can be conveniently learned by shallow and deep in the downstream task.

Through experimental verification, compared with the method for directly training the multi-modal model, the multi-stage multi-task training method provided by the embodiment of the application can improve the effect of more than 16 percent by means of the learning of general data and data in the field, and obviously, the model is more accurate to the expression of the video.

Referring to fig. 14, based on the same inventive concept, an embodiment of the present application further provides a video classification apparatus 140 based on multi-modal representation, the apparatus including:

a data obtaining unit 1401, configured to obtain data information of the target video corresponding to each modality, and input the data information of each modality to a trained target multi-modality video representation model;

the video classification unit 1402 is configured to obtain a video service category of a target video in a target service scene, where the target video is output by the target multi-modal video representation model;

the target multi-modal video representation model is obtained by performing video frequency domain adaptive pre-training on the basis of base video data sample sets corresponding to the modes respectively and performing retraining on the basis of video service data sample sets corresponding to the modes respectively in a target service scene, each base video data sample set comprises base video data samples corresponding to the same mode respectively, and each video service data sample set comprises video service data samples corresponding to the same mode respectively.

Optionally, the target multi-modal video representation model includes a plurality of single-modal encoding submodels and a feature fusion submodel for performing feature fusion on single-modal feature vectors output by the plurality of single-modal encoding submodels, and each single-modal encoding submodel corresponds to one mode of the video;

the apparatus further comprises a model training unit 1403 for:

adopting a basic video data sample set corresponding to each mode, and respectively carrying out adaptive pre-training on a pre-training coding sub-model corresponding to each mode to obtain a single-mode coding sub-model corresponding to each mode; each pre-training coding sub-model is obtained by performing initialization pre-training on the basis of a general data set of a corresponding mode;

performing iterative training on each obtained single-mode coding sub-model and each obtained characteristic fusion sub-model by adopting a video service data sample set corresponding to each mode until a set convergence condition is met;

and outputting the trained target multi-modal video representation model when the set convergence condition is met.

Optionally, the model training unit 1403 is specifically configured to:

for each video, the following operations are respectively executed:

for one video, respectively adopting each single-mode coding sub-model to perform feature coding on a video service data sample of a corresponding mode of the video to obtain a plurality of single-mode feature vectors, wherein each single-mode feature vector corresponds to one mode;

performing feature fusion on a plurality of single-mode feature vectors by adopting a feature fusion sub-model to obtain a video feature vector of a video;

and respectively carrying out parameter adjustment on each single-mode coding sub-model and each characteristic fusion sub-model based on the obtained video characteristic vector and the plurality of single-mode characteristic vectors corresponding to each video.

Optionally, each monomodal coding submodel includes at least one attention weight matrix, and the model training unit 1403 is specifically configured to:

and obtaining a plurality of single-mode feature vectors according to the obtained weight feature vector sets.

Optionally, the at least one attention weight matrix comprises a query vector weight matrix, a key vector weight matrix, and a value vector weight matrix; model training unit 1403 is specifically configured to:

for each basic feature vector, the following operations are respectively executed: aiming at one basic feature vector, respectively obtaining a corresponding query vector, a corresponding key vector and a corresponding value vector according to the basic feature vector, a query vector weight matrix, a corresponding key vector weight matrix and a corresponding value vector weight matrix;

for each basic feature vector, the following operations are respectively performed:

respectively obtaining the attention weight value corresponding to each basic feature vector according to the query vector of the basic feature vector and the key vector of each basic feature vector; the attention weight value is used for representing the association degree of video data of a video in a mode corresponding to each basic feature vector and video data of a video in a mode corresponding to one basic feature vector;

and obtaining a single-mode feature vector of a video in a mode corresponding to the basic feature vector according to the value vector and the attention weight value corresponding to each basic feature vector.

Optionally, the model training unit 1403 is specifically configured to:

vector splicing is carried out on a plurality of single-mode feature vectors according to a set mode by adopting a vector splicing layer included by the feature fusion sub-model, and a video feature vector of a video is obtained;

pooling the plurality of single-mode feature vectors by using a pooling layer included in the feature fusion sub-model to obtain a video feature vector of a video;

performing convolution operation on a feature matrix consisting of a plurality of monomodal feature vectors by using a convolution layer included in the feature fusion sub-model by using a set step length to obtain a video feature vector of a video;

and mapping the plurality of single-mode feature vectors by adopting a full connection layer included by the feature fusion sub-model to obtain the video feature vector of one video.

Optionally, the model training unit 1403 is specifically configured to:

obtaining model total loss of a multi-modal video representation model based on the obtained single-modal coding loss corresponding to each single-modal coding sub-model and the video representation loss;

Optionally, the model training unit 1403 is specifically configured to:

for each video, the following operations are respectively executed:

determining a predicted video service category of a video in a target video service scene based on a video feature vector of the video;

determining a video representation loss of a video based on the obtained prediction video service class and an annotated video service type included in a service data sample of the video;

Optionally, the data obtaining unit 1401 is specifically configured to:

for each video, the following operations are respectively executed:

splicing the text in the video, the title text of the video and the introduction text of the video to obtain a spliced text corresponding to the video;

The apparatus may be configured to execute the methods shown in the embodiments shown in fig. 2 to 13, and therefore, for functions and the like that can be realized by each functional module of the apparatus, reference may be made to the description of the embodiments shown in fig. 2 to 13, which is not repeated here. The model training unit 1403 is not an essential functional unit, and is shown by a dotted line in fig. 14.

The video classification device based on multi-mode representation provided by the embodiment of the application, when a target multi-mode video representation model is trained, firstly, the adaptability of a video domain is pre-trained on the model based on a basic video data sample, after the pre-training is completed, the video service data sample based on a target service scene is re-trained, and thus, the model has certain preliminary expression capacity by pre-training the video domain, so that the model training difficulty of a downstream video related task is lower, the model is easier to converge, the time consumption of the model training is correspondingly reduced, and the model training efficiency is higher. In addition, the multi-modal representation model is more hierarchical in learning through the staged training process, so that the multi-modal representation capability is richer, the model is convenient to adapt to different types of downstream video classification tasks, and the accuracy of the downstream video classification task is improved.

Referring to fig. 15, based on the same technical concept, an embodiment of the present application further provides a computer device 150, which may include a memory 1501 and a processor 1502. The computer device 150 may be, for example, the terminal device 101 or the server 102 shown in fig. 1, and when the computer device 150 is the server 102, the memory 1501 and the processor 1502 may correspond to the memory 1022 and the processor 1021 of the server 102, respectively.

The memory 1501 is used for storing computer programs executed by the processor 1502. The memory 1501 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to use of the computer device, and the like. The processor 1502 may be a Central Processing Unit (CPU), a digital processing unit, or the like. The specific connection medium between the memory 1501 and the processor 1502 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 1501 and the processor 1502 are connected by the bus 1503 in fig. 15, the bus 1503 is shown by a thick line in fig. 15, and the connection manner between other components is merely illustrative and not limited. The bus 1503 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 15, but this is not intended to represent only one bus or type of bus.

The memory 1501 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1501 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a solid-state drive (SSD), or the memory 1501 may be any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 1501 may be a combination of the above memories.

A processor 1502 for executing the methods performed by the devices in the embodiments shown in fig. 2-13 when calling the computer program stored in the memory 1501.

In some possible embodiments, various aspects of the methods provided by the present application may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of the methods according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device, for example, the computer device may perform the methods performed by the devices in the embodiments shown in fig. 2-13.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for video classification based on a multi-modal representation, the method comprising:

2. The method of claim 1, wherein the target multi-modal video representation model comprises a plurality of single-modal encoding submodels, and a feature fusion submodel for performing feature fusion on single-modal feature vectors output by the plurality of single-modal encoding submodels, each single-modal encoding submodel corresponding to a mode of the video;

the target multimodal video representation model is trained in the following way:

3. The method of claim 2, wherein each iterative training comprises the steps of:

for each video, the following operations are respectively executed:

4. The method of claim 3, wherein each single-modality encoding sub-model includes at least one attention weight matrix;

then, respectively using each single-mode coding sub-model to perform feature coding on the video service data sample of the corresponding mode of the video, so as to obtain the plurality of single-mode feature vectors, including:

5. The method of claim 4, in which the at least one attention weight matrix comprises a query vector weight matrix, a key vector weight matrix, and a value vector weight matrix;

then, according to each obtained basic feature vector and at least one attention weight matrix included in the corresponding monomodal coding submodel, a set of weight feature vectors corresponding to each basic feature vector is obtained, including:

obtaining the plurality of single-mode feature vectors according to the obtained respective sets of weighted feature vectors, including:

6. The method of claim 3, wherein performing feature fusion on the plurality of single-mode feature vectors using the feature fusion submodel to obtain video feature vectors of the video, comprises:

7. The method according to claim 3, wherein the parameter adjustment for each of the single-mode coding submodel and the feature fusion submodel based on the obtained video feature vector and the plurality of single-mode feature vectors corresponding to each of the videos respectively comprises:

determining video representation loss of the target multi-modal video representation model based on the video feature vectors corresponding to the videos respectively;

obtaining the total model loss of the target multi-modal video representation model based on the obtained single-modal coding loss corresponding to each single-modal coding sub-model and the video representation loss;

8. The method of claim 7, wherein determining a video representation loss for the target multi-modal video representation model based on respective corresponding video feature vectors for respective videos comprises:

for each video, the following operations are respectively executed:

and obtaining the video representation loss of the target multi-modal video representation model based on the obtained video representation loss of each video.

9. A method as claimed in any one of claims 1 to 8, wherein a base set of video data samples is any one of the following sets of data samples:

10. The method of claim 9, wherein the method further comprises:

for each video, the following operations are respectively executed:

11. An apparatus for video classification based on a multi-modal representation, the apparatus comprising:

12. The apparatus of claim 11, wherein the target multi-modal video representation model comprises a plurality of single-modal encoding sub-models, each corresponding to a modality of video, and a feature fusion sub-model for feature fusion of single-modal feature vectors output by the plurality of single-modal encoding sub-models;

the apparatus further comprises a model training unit for:

13. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor,

the processor, when executing the computer program, realizes the steps of the method of any one of claims 1 to 10.

14. A computer storage medium having computer program instructions stored thereon, wherein,

the computer program instructions, when executed by a processor, implement the steps of the method of any one of claims 1 to 10.