CN114443899A

CN114443899A - Video classification method, device, equipment and medium

Info

Publication number: CN114443899A
Application number: CN202210108236.1A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-06

Abstract

The application discloses a video classification method, a video classification device, video classification equipment and a video classification medium, and relates to the field of machine learning. The method comprises the following steps: acquiring n multi-modal characteristics of a target video, wherein the n multi-modal characteristics comprise at least two of image characteristics, audio characteristics and text characteristics, and n is a positive integer greater than 1; fusing n multi-modal characteristics to obtain fused characteristics; classifying the fusion features according to m classification granularity to obtain an overall classification feature and m granularity classification features of the target video, wherein the m classification granularity is used for representing m interrelated granularity used for classification under the target dimension, and the m granularity classification features are used for representing classification features corresponding to the m classification granularity; and obtaining m-level classification labels of the target video according to the overall classification features and the m granularity classification features, wherein the m-level classification labels are video labels arranged according to m classification granularities. The method and the device can fuse multi-modal information, so that the video label is more accurate.

Description

Video classification method, device, equipment and medium

Technical Field

The present application relates to the field of machine learning, and in particular, to a video classification method, apparatus, device, and medium.

Background

After the video is uploaded to the video platform, the video is tagged with a category label, for example, the category label of one video is science-smart phone-home phone.

The related art trains a machine learning model for video classification in advance. And extracting image features from the target video, inputting the image features into a machine learning model, and performing data processing on the image features by the machine learning model to output a classification label of the target video.

However, the related art uses information related to image features in the target video, and the finally output classification label is inaccurate.

Disclosure of Invention

The embodiment of the application provides a video classification method, a device, equipment and a medium, the method can be used for determining classification labels of target videos from multiple modes by combining multi-mode characteristics in the target videos, and the technical scheme is as follows:

according to an aspect of the present application, there is provided a video classification method, including:

acquiring n multi-modal characteristics of a target video, wherein the n multi-modal characteristics comprise at least two of image characteristics, audio characteristics and text characteristics, and n is a positive integer greater than 1;

fusing the n multi-modal features to obtain fused features;

classifying the fusion features according to m classification particle sizes to obtain an overall classification feature and m particle size classification features of the target video, wherein the m classification particle sizes are used for representing m interrelated particle sizes for classification under a target dimension, and the m particle size classification features are used for representing classification features corresponding to the m classification particle sizes;

and obtaining m-level classification labels of the target video according to the overall classification features and the m granularity classification features, wherein the m-level classification labels are video labels arranged according to the m classification granularities.

According to another aspect of the present application, there is provided a video classification apparatus, including:

the system comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for acquiring n multi-modal features of a target video, the n multi-modal features comprise at least two of image features, audio features and text features, and n is a positive integer greater than 1;

the fusion module is used for fusing the n multi-modal characteristics to obtain fused characteristics;

the classification module is used for classifying the fusion features according to m classification granularity to obtain integral classification features and m granularity classification features of the target video, wherein the m classification granularity is used for representing m interrelated granularity used for classification under a target dimension, the m granularity classification features are used for representing classification features corresponding to the m classification granularity, and m is a positive integer;

the classification module is further configured to obtain m-level classification labels of the target video according to the overall classification features and the m granularity classification features, where the m-level classification labels are video labels arranged according to the m classification granularities.

According to one aspect of the present application, there is provided a training method of a video classification model, where the video classification model includes a feature extraction network layer, a feature fusion network layer, and a classification network layer, the method includes:

obtaining a sample training set, wherein the sample training set comprises a sample video and a real label corresponding to the sample video;

calling the feature extraction network layer, carrying out data processing on the sample video, and outputting n sample multi-modal features, wherein the n sample multi-modal features comprise at least two of sample image features, sample audio features and sample text features, and n is a positive integer greater than 1;

calling the feature fusion network layer, carrying out fusion processing on the n sample multi-modal features, and outputting sample fusion features;

calling the classification network layer, respectively carrying out data processing on the n sample multi-modal characteristics and the sample fusion characteristics, and respectively outputting n sample granularity classification labels and sample fusion classification labels;

respectively calculating the cross entropy between the n sample granularity classification labels and the real labels to obtain n granularity cross entropies; calculating the cross entropy between the sample fusion classification label and the real label to obtain a fusion cross entropy;

and training the video classification model according to the n granularity cross entropies and the fusion cross entropy.

According to an aspect of the present application, there is provided a training apparatus for a video classification model, the video classification model including a feature extraction network layer, a feature fusion network layer, and a classification network layer, the apparatus including:

the system comprises a sample acquisition module, a real label generation module and a real label analysis module, wherein the sample acquisition module is used for acquiring a sample training set, and the sample training set comprises a sample video and a real label corresponding to the sample video;

the sample feature extraction module is used for calling the feature extraction network layer, performing data processing on the sample video and outputting n sample multi-modal features, wherein the n sample multi-modal features comprise at least two of sample image features, sample audio features and sample text features, and n is a positive integer greater than 1;

the sample fusion module is used for calling the feature fusion network layer, carrying out fusion processing on the n sample multi-modal features and outputting sample fusion features;

the sample classification module is used for calling the classification network layer, respectively carrying out data processing on the n sample multi-modal characteristics and the sample fusion characteristics, and respectively outputting n sample granularity classification labels and sample fusion classification labels;

the training module is used for respectively calculating the cross entropy between the n sample granularity classification labels and the real labels to obtain n granularity cross entropies; calculating the cross entropy between the sample fusion classification label and the real label to obtain a fusion cross entropy;

the training module is further configured to train the video classification model according to the n granularity cross entropies and the fusion cross entropy.

According to another aspect of the present application, there is provided a computer device including: a processor and a memory, the memory having stored therein at least one program that is loaded into and executed by the processor to implement the video classification method, or the training method of the video classification model, as described above.

According to another aspect of the present application, there is provided a computer storage medium having at least one program code stored therein, the program code being loaded and executed by a processor to implement the video classification method, or the training method of the video classification model, as described above.

According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and executes the computer instructions, so that the computer device executes the video classification method or the training method of the video classification model as described above.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the method fuses the multi-modal characteristics of the target video to obtain fused characteristics, and determines the classification label of the target video according to the fused characteristics. The method can mine finer granularity information and deepen the understanding and extraction of the video content through multi-modal characteristics. When the fusion features are classified, the dependency relationship and constraint information among different modalities are fully utilized from the overall perspective, and the accuracy of video classification results is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic block diagram of a computer system provided in an exemplary embodiment of the present application;

FIG. 2 is a block diagram of a computer system provided in an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a video classification model provided by an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of a video classification model provided by an exemplary embodiment of the present application;

FIG. 5 is a flowchart illustrating a video classification method according to an exemplary embodiment of the present application;

FIG. 6 is a flowchart illustrating a video classification method according to an exemplary embodiment of the present application;

FIG. 7 is a block diagram of a classification module provided in an exemplary embodiment of the present application;

FIG. 8 is a flowchart illustrating a video classification method according to an exemplary embodiment of the present application;

FIG. 9 is a schematic structural diagram of a fusion module provided in an exemplary embodiment of the present application;

FIG. 10 is a schematic flow chart diagram of a feature fusion method provided in an exemplary embodiment of the present application;

FIG. 11 is a schematic flow chart diagram of a feature fusion method provided in an exemplary embodiment of the present application;

FIG. 12 is a flowchart illustrating a method for training a video classification model according to an exemplary embodiment of the present application;

FIG. 13 is a schematic illustration of calculating relative entropy provided by an exemplary embodiment of the present application;

FIG. 14 is a flowchart illustrating a method for training a video classification model according to an exemplary embodiment of the present application;

FIG. 15 is a flowchart illustrating a video recommendation method according to an exemplary embodiment of the present application;

FIG. 16 is a flowchart illustrating a video recommendation method according to an exemplary embodiment of the present application;

fig. 17 is a schematic structural diagram of a video classification apparatus according to an exemplary embodiment of the present application;

FIG. 18 is a schematic structural diagram of an apparatus for training a video classification model according to an exemplary embodiment of the present application;

fig. 19 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, terms referred to in the embodiments of the present application are described:

artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (Computer Vision, CV): computer vision is a science for researching how to make a machine look, and in particular, it is a science for using a camera and a computer to replace human eyes to make machine vision of identifying, tracking and measuring target, and further making image processing, so that the computer processing becomes an image more suitable for human eyes observation or transmitting to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes technologies such as image processing, image Recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and the like, and also includes common biometric technologies such as face Recognition, fingerprint Recognition, and the like.

Natural Language Processing (NLP): is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning.

Relative Entropy (Relative Entropy): also known as Kullback-leibler (kl) divergence or information divergence (information divergence), is a measure of asymmetry in the difference between two probability distributions. The relative entropy is equivalent to the difference between the information entropies (Shannon entrypes) of the two probability distributions.

Cross Entropy (Cross Entropy): for measuring the dissimilarity information between the two probability distributions. The cross entropy can be used as a loss function in machine learning, p represents the distribution of real marks, q is the distribution of predicted marks of the trained model, and the similarity of p and q can be measured through the cross entropy.

Mel-frequency spectrum: converting the frequency scale (Hz, Hz) of the common spectrogram into Mel frequencyAnd (4) obtaining a spectrogram by rate scaling. The mapping relationship is as follows: mel frequency scale mel (f) 2595 log₁₀(1+ f/700), f represents the frequency scale in hertz.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

Fig. 1 shows a schematic structural diagram of a computer system provided in an exemplary embodiment of the present application. The computer system 100 includes: a terminal 120 and a server 140.

The terminal 120 has an application program related to video playing installed thereon. The application program may be an applet in an app (application), may be a special application program, and may also be a web client. The terminal 120 includes at least one of a content provider and a content consumer, wherein the content provider refers to a terminal that provides a video and the content consumer refers to a terminal that plays the video. Illustratively, the user plays a video on the terminal 120. The terminal 120 is at least one of a smartphone, a tablet, an e-book reader, an MP3 player, an MP4 player, a laptop portable computer, and a desktop computer.

The terminal 120 is connected to the server 140 through a wireless network or a wired network.

The server 140 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. The server 140 is used to provide background services for the video playing application program and send the video and information related to the video to the terminal 120. Alternatively, the server 140 undertakes primary computational tasks and the terminal 120 undertakes secondary computational tasks; alternatively, the server 140 undertakes the secondary computing work and the terminal 120 undertakes the primary computing work; alternatively, both the server 140 and the terminal 120 employ a distributed computing architecture for collaborative computing.

As shown in fig. 2, in an alternative design of the present application, the computer system includes a content production end 201, an uplink and downlink content interface server 202, a scheduling center server 203, a content database 204, a video classification model 205, a video classification service 206, a content re-ranking service 207, a content distribution export service 208, a content consumption end 209, a manual review system 210, a content storage service 211, a download file system 212, and a video content framing and audio separation 213.

The Content provided by the Content production terminal 201 includes at least one of PGC (Professional Generated Content), UGC (User Generated Content), MCN (Multi-Channel Network), and PUGC (Professional Generated Content). The content producing end 201 provides video to the upstream and downstream content Interface server 202 through a mobile end Interface API (Application Programming Interface) system or a background Interface API system. Optionally, the content production end 201 acquires the interface address of the uplink and downlink content interface server 202 through communication with the uplink and downlink content interface server 202, and uploads the video to the uplink and downlink content interface server 202 through the interface address.

The uplink and downlink content interface server 202 directly communicates with the content production end 201, writes the meta information of the video provided by the content production end 201 into the content database 204, and submits the video content to the dispatch center server 203. The meta information of the video comprises at least one of file size, cover picture link, code rate, file format, title, release time and author of the video.

The dispatch center server 203 is responsible for the entire scheduling process of the video stream. Illustratively, the dispatch center server 203 is configured to receive the binned content via the uplink and downlink content interface server 202, and obtain the meta-information of the video from the content database 204. Illustratively, the dispatch center server 203 is also used to dispatch the manual review system 210 and other processing systems, controlling the order and priority of the dispatch. Illustratively, the dispatch center server 203 is further configured to invoke a content re-ordering service 207 to filter out unnecessary repeated similar content, and for content that does not reach repeated filtering, the system outputs a content similarity and similarity relationship chain for the system to break up. Illustratively, the dispatch center server 203 also enables recommendation of videos to content consumers 209 through a content export distribution service 208 (e.g., a recommendation engine or search engine or operator) with content by invoking a manual review system 210. Illustratively, the dispatch center server 203 is also responsible for communicating with a video classification service to perform multi-level classification and labeling of videos.

The content database 204 stores meta information of videos. For example, the uplink and downlink content interface server 202 may perform transcoding operation on the video after receiving the video, and store the meta-information of the video in the content database 204 after transcoding is completed. Optionally, the content database 204 also stores the results of the manual review by the manual review system 210. Optionally, the content database 204 also stores the re-ranking results of the content re-ranking service 207.

The video classification model 205 is used to generate video tags at different granularities using multi-modal features of the video. Illustratively, the label for video is "science-smart phone-homemade phone".

The video classification service 206 is used to service the video classification model 205 for communication with the dispatch center server 203.

The content re-ordering service 207 is used to exclude similar uploaded videos. Optionally, the content rearrangement service 207 vectorizes the uploaded video to obtain an uploaded video vector; the similarity degree between the uploaded video vector and the existing video vector is compared through the vector index. The content deduplication service 207 may enable engineering and parallelization of deduplication services, avoiding duplicate videos being enabled.

The content distribution export service 208 is configured to provide the index information of the video to the content consumption end 209, so that the content consumption end 208 downloads the streaming media information of the video and plays the video through the video player.

The content consumer 209 is used to obtain video from the content distribution outlet service 208. Optionally, the content consumption end 209 obtains different types of videos and the display content of different video channels according to the content distribution export service 208. Alternatively, the content consumption end 209 downloads the video directly from the content storage service 211.

The manual review system 210 is used for performing secondary review on the content of the video. Illustratively, the manual review system 210 is a system developed based on a web database. Because the video classification model 205 can have errors, the accuracy and efficiency of the classification labels of the videos are improved by performing secondary manual review processing.

The content storage service 211 is used for storing the video uploaded by the content production end 201. Optionally, the content storage service 211 comprises a data source of an internal service and a data source of an external service, and the data sources of the internal service and the external service are separately deployed to avoid mutual influence. Optionally, the Content storage service 211 performs distributed caching by a CDN (Content Delivery Network) acceleration server to realize acceleration.

The download file system 212 is used to control the speed and progress of the video download. Optionally, the download file system is a group of parallel servers, and comprises related task scheduling clusters and distribution clusters.

Video content decimation and audio separation 213 is used for sample decimation of video and the use and separation of audio information. Illustratively, when the video downloaded by the download file system 212 is provided to the video classification model 205, the video content decimation and audio separation 213 takes the key frames of the video or calls a TSN (time sensitive Network) Network for sampling.

Fig. 3 shows a schematic structural diagram of a video classification model provided in an exemplary embodiment of the present application. The video classification model 300 includes a feature extraction module 301, a feature fusion module 302, and a classification module 303.

The feature extraction module 301 is configured to extract multi-modal features of the target video. The feature extraction module 301 includes a first feature extraction network, a second feature extraction network, a third feature extraction network, … …, and an nth feature extraction network, where n feature extraction network layers are provided, and each feature extraction network layer may extract features of different modalities of the target video. For example, a first feature extraction network is used to extract image features of a target video, and a second feature extraction network is used to extract audio features of the target video.

The feature fusion module 302 is used to fuse the multi-modal features of the target video. The input to the feature fusion module 302 is the multi-modal features output by the feature extraction module 301 and the output is the fused features 305. After the multi-modal features of the target video are fused, fused features 305 are obtained. Optionally, the optimization module 304 is invoked to sequentially perform compression processing, activation processing, and context gating processing on the fusion feature 304, so as to obtain a fusion feature 305.

The classification module 303 is configured to determine labels of the target video at different granularities according to the fusion features 305. The input to the classification module 303 is the fusion features 305 and the output is the labels at different granularities of the target video. Optionally, taking an example that the classification label outputs labels of three granularities as an example for explanation, the classification module 303 includes a first-level label classifier 306, a second-level label classifier 307, and a third-level label classifier 308 which are cascaded, where classification labels output by the first-level label classifier 306, the second-level label classifier 307, and the third-level label classifier 308 are arranged from large to small according to granularity. And calling a primary label classifier 306 to perform data processing on the fusion features 305, outputting primary class classification features, and determining primary class labels according to the primary class classification features. And calling a secondary label classifier 307 to perform data processing on the fusion feature 305 and the primary class classification feature, outputting a secondary class classification feature, and determining a secondary class label according to the secondary class classification feature. And calling a tertiary label classifier 308 to perform data processing on the fusion feature 305, the secondary class classification feature and the hidden layer feature of the secondary label classifier 307, outputting a tertiary class classification feature, and determining a tertiary class label according to the tertiary class classification feature. Illustratively, the classification module 205 outputs classification labels of three granularities as "technology-handset-homemade handset".

In the following embodiments, taking the extraction of image features, audio features and text features as an example for explanation, fig. 4 shows a structural schematic diagram of a video classification model provided in an exemplary embodiment of the present application. The video classification model 400 includes an image feature extraction module 401, an audio feature extraction module 402, a text feature extraction module 403, a feature fusion module 302, and a classification module 303.

The image feature extraction module 401 is configured to extract image features of the entire target video. The input to the image feature extraction module is the target video 404 and the output is the image features 407. In an alternative design, the target video frames are extracted from the target video 3404 by a sparse sampling operation; calling an image feature extraction network layer 405, performing data processing on the target video frame, and outputting video frame features corresponding to the target video frame; and calling an image feature fusion network layer 406, and fusing the video frame features from the perspective of the image features to obtain image features 407 of the target video. Illustratively, the target video frame is extracted from the target video 404 over the TSN network. Illustratively, the image feature extraction network layer 406 is RestNet (residual network) or Xception (a type of inclusion network). Illustratively, the image feature fusion network layer 308 is a NeXtVLAD network (an improved VLAD network, in which non-linear parameters of the VLAD layer in the VLAD network are added, and the VLAD refers to Vector of Local Aggregated description vectors), and the NeXtVLAD network may decompose high-dimensional video frame features into a group of low-dimensional video frame features, and aggregate the low-dimensional video frame features into low-dimensional image features.

The audio feature extraction module 402 is configured to extract audio features of the entire target video. The input to the audio feature extraction module is a spectrogram 408 of the audio of the target video, and the output is an audio feature 412. In an alternative design, spectrogram 408 of the target video is converted to Mel spectrogram 409; calling an audio feature extraction network, performing data processing on the Mel frequency spectrogram 409, and outputting an audio feature 412 of the target video, wherein the Mel frequency spectrogram 409 comprises frequency spectrums of different frequency bands which are not overlapped with each other, so that the output audio feature is an audio feature comprising a plurality of frequency bands, and therefore the audio feature fusion network 411 is required to be called to fuse the audio features of the plurality of frequency bands to obtain a final audio feature 412. Illustratively, the audio feature extraction network 312 is a Vggish network. For example, the audio feature fusion network 411 is a NeXtVLAD network, and similarly, in the case that the audio feature fusion network 411 is the NeXtVLAD network, the audio feature fusion network layer 411 in the audio feature extraction module 402 also decomposes the high-dimensional audio features in different frequency bands into a set of low-dimensional features, and aggregates the low-dimensional features to represent the low-dimensional audio features of the entire target video. Optionally, 16kHz of audio in the first 10 minutes of the target video is selected, a hamming time window of 25ms is used, a 10ms frame shift is used to perform short-time fourier transform on the audio to obtain a spectrogram 408, and then the spectrogram is mapped into a 64-order mel filter bank to obtain a mel spectrogram 409 through calculation. The mel spectral pattern 409 is framed in 960ms durations with no overlap between frames, each 10ms duration frame, each frame containing 64 mel frequency bands.

The text feature extraction module 403 is configured to extract text features of the entire target video. The input to the text feature extraction module is the title and text content 413 and the output is the text feature 416. In an alternative design, the text content in the target video is extracted; and calling a text feature extraction network layer 415, performing data processing on the text content and the title of the target video, and outputting a text feature 416 of the target video. Illustratively, the text feature extraction network layer 415 is a corpus training from transforms (Bidirectional Encoder Representation) model based on information flow large-scale text corpus. In practice, it is desirable to provide the textual content of the target video because some videos do not have titles or insufficient information conveyed by titles. Optionally, the content of the target video is subjected to text Recognition through OCR (Optical Character Recognition), so as to obtain the text content. But there are some problems with the text content obtained by OCR, such as: in the process of screen switching, OCR recognition is not accurate, fixed position OCR needs to be removed, dictation OCR needs to be reserved, and news scroll OCR needs to be deleted. Optionally, the text pre-processing network layer 414 is invoked to pre-process the text content, where the pre-processing includes, but is not limited to, filtering single-word/pure-number/pure-letter OCR, filtering OCR with small text position offset and high text repetition rate between two adjacent frames, filtering OCR with text at the bottom of the screen and smaller OCR.

It should be noted that the image feature extraction module 4014, the audio feature extraction module 402, and the text feature extraction module 403 all belong to feature extraction modules, in this embodiment, the number of the feature extraction modules is at least two, and the number of the feature extraction modules may be more.

The feature fusion module 304 is used for fusing features of the target video. The feature fusion module 304 may fuse the image features 407, the audio features 412, and the text features 416, resulting in fused features 303.

The classification module 303 is configured to determine labels of the target video at different granularities according to the fusion features 303.

Fig. 5 is a flowchart illustrating a video classification method according to an exemplary embodiment of the present application. The method may be performed by the terminal 120 or the server 140 or other computer device shown in fig. 1, the method comprising:

step 502: n multi-modal features of the target video are obtained.

The target video may be any video. The target video may be a local video stored by the computer device, or a video downloaded from a network, or a video provided by another computer device, which is not limited in this embodiment of the present application.

The multi-modal features refer to corresponding features of the target video in different modalities. The multi-modal features include at least two features. In an embodiment of the application, the n multi-modal features comprise at least two of image features, audio features and text features, and n is a positive integer greater than 1.

In an alternative design of the embodiments of the present application, a feature extraction network layer is used to obtain multi-modal features. Illustratively, the image features are obtained using an image feature extraction network layer; acquiring audio features by using an audio feature extraction network layer; the text feature extraction network layer is used to obtain text features.

Extracting a target video frame from the target video under the condition that the n multi-modal characteristics comprise image characteristics; calling an image feature extraction network, performing data processing on the target video frame, and outputting the video frame feature of the target video frame; and fusing the video frame characteristics to obtain the image characteristics of the target video. The target video frame is a video frame extracted from the target video according to a preset frame extraction strategy. Illustratively, the TSN network is invoked to extract the target video frame from the target video 306. Optionally, the image feature extraction network is RestNet or Xception.

Under the condition that the n multi-modal characteristics comprise audio characteristics, acquiring a Mel frequency spectrogram of the target video; and calling an audio characteristic extraction network, carrying out data processing on the Mel frequency spectrogram, and outputting the audio characteristic of the target video. Optionally, a Vggish network is called to perform data processing on the mel frequency spectrum diagram, and the audio features of the target video are output.

Extracting text content in the target video under the condition that the n multi-modal features comprise text features; and calling a text feature extraction network, carrying out data processing on the text content and the title of the target video, and outputting the text feature of the target video. In an alternative design, the content of the target video is subjected to text recognition through OCR, so that the text content is obtained. In an optional design, a prediction training BERT model based on information flow large-scale text corpora is called, data processing is carried out on text contents and the title of the target video, and text features of the target video are output.

Step 504: and fusing the n multi-modal characteristics to obtain fused characteristics.

Fused features refer to features that contain n multi-modal features.

Optionally, a vector bit-wise multiplication method is adopted to fuse the n multi-modal features.

Step 506: and classifying the fusion features according to the m classification particle sizes to obtain the whole classification features and m particle size classification features of the target video.

The m classification granularity is used for representing m interrelated granularity for classification in a target dimension, and m is a positive integer. The target dimension refers to a dimension for classifying the fusion features, and the obtained m-type particle size classification features are all features corresponding to the target dimension. For example, 3 granularity classification features are obtained, and the 3 granularity classification features all correspond to science and technology dimensions. Optionally, the target dimension is determined from the video content of the target video. In an alternative design, the target dimension is the largest of the m classification granularities.

The m granularity classification features are used to represent classification features corresponding to the m classification granularities. For example, assuming that there are 3 classification granularities, the target video is classified, and 3 classification granularity features are obtained. The classification granularity characteristic 1 corresponds to a technology and is a characteristic with the largest granularity, the classification granularity characteristic 2 corresponds to a mobile phone and is a characteristic with a medium granularity, and the classification granularity characteristic 3 corresponds to a domestic mobile phone and is a characteristic with the smallest granularity.

The overall classification feature is a feature that includes m granular classification features. The overall classification characteristic may include a complete m granular classification characteristics, that is, may include a partial characteristic of the m granular classification characteristics. Illustratively, the overall classification feature extracts a part of features from each of the m particle size classification features, and the extracted features are fused to obtain the overall classification feature.

In an optional design, m cascaded tag classifiers are called to perform data processing on the fusion features to obtain m granularity classification features. And fusing hidden layer characteristics of the m cascaded label classifiers to obtain the overall classification characteristic. Optionally, the m tag classifiers belong to a fully connected feed-forward network.

Step 508: and obtaining m-level classification labels of the target video according to the overall classification features and the m granularity classification features.

The applicant considers that the finally obtained classification label with a certain granularity should comprehensively consider the overall classification characteristics of the whole classification system, and the m-level classification labels of the target video are determined by taking the overall classification characteristics as reference.

In an optional design, the overall classification features are respectively combined with m granularity classification features to obtain m target classification features; and determining m levels of classification labels of the target video according to the m target classification features. The m target classification features are used to determine classification labels at m classification granularities.

In an alternative design, the target classification feature includes probability values of a plurality of candidate tags, and the candidate tag corresponding to the maximum probability value in the target classification feature is determined as the classification tag. For example, if the target classification features are {0.1, 0.5, 0.35, 0.05}, the candidate label corresponding to 0.1 is "science and technology", the candidate label corresponding to 0.5 is "life", the label corresponding to 0.35 is "politics", and the label corresponding to 0.05 is "literature", the candidate label "life" is determined as the required classification label.

The m-level classification tags are video tags arranged in m-level classification granularity. Illustratively, the 3-level classification label for the target video is "life-food preparation-chinese cuisine".

To sum up, the present embodiment fuses the multi-modal features of the target video to obtain a fused feature, and determines the classification label of the target video according to the fused feature. The method can mine finer granularity information and deepen the understanding and extraction of the video content through multi-modal characteristics. When the fusion features are classified, the dependency relationship and constraint information among different modalities are fully utilized from the overall perspective, and the accuracy of video classification results is improved.

Fig. 6 is a flowchart illustrating a video classification method according to an exemplary embodiment of the present application. The method may be performed by the terminal 120 or the server 140 or other computer device shown in fig. 1, the method comprising:

step 601: n multi-modal features of the target video are obtained.

The multi-modal features refer to corresponding features of the target video in different modalities. The multi-modal features include at least two features.

Step 602: based on the attention mechanism, n weights corresponding to the n multi-modal features are determined.

The target video can be characterized by different modalities from different angles, and the contribution of different modalities to the classification result is inconsistent. For example, image features provide a larger amount of information, while audio features provide a smaller amount of information. Therefore, it is necessary to give weights to the n multi-modal features through an attention mechanism so that the subsequently obtained fusion features are more accurate.

Illustratively, n weights corresponding to n multimodal features are determined based on a gated attention mechanism. The method is based on weights of dimension-specific scalar multimodal features that are dynamically generated by a gating mechanism.

Step 603: and performing weighted calculation on the n multi-modal characteristics according to the n weights to obtain a fusion characteristic.

Alternatively, the method of weight calculation is bit-wise multiplication of vectors.

Step 604: and calling m cascaded label classifiers to perform data processing on the fusion features to obtain m granularity classification features.

The m label classifiers are used for classifying the fusion features from the m classification particle sizes to obtain m particle size classification features representing the m classification particle sizes. In an alternative design, the connection between the m tag classifiers can refer to the connection of the tag classifiers in the classification module 303 shown in fig. 3 or 4.

The m particle size classification features are used for representing the features corresponding to the m classification particle sizes.

Step 605: and fusing hidden layer characteristics of the m label classifiers according to the cascade sequence of the m label classifiers to obtain the overall classification characteristic.

In an alternative design, the m tag classifiers belong to a fully connected feed-forward network.

Illustratively, the cascade order of the 3 label classifiers is "label classifier 1-label classifier 2-label classifier 3", and then when the hidden layer features of the label classifiers are fused, the hidden layer features of the label classifier 1, the label classifier 2 and the label classifier 3 are fused in sequence.

It should be noted that, step 604 and step 605 are not in sequence, and step 604 may be executed first and then step 605 is executed, or step 605 may be executed first and then step 604 is executed, which is not limited in this embodiment of the present application.

Step 606: and respectively combining the overall classification features with the m granularity classification features to obtain m target classification features.

Optionally, after the m granularity classification features are respectively spliced to the overall classification features, m target classification features are obtained. Or respectively splicing the m granularity classification features to the front of the whole classification features to obtain m target classification features.

Step 607: and determining m levels of classification labels of the target video according to the m target classification features.

In an alternative design, the target classification feature includes probability values of a plurality of candidate tags, and the candidate tag corresponding to the maximum probability value in the target classification feature is determined as the classification tag.

And when the fusion features are classified, the granularity classification features are obtained from different granularities, information with finer granularity can be mined and the comprehension of video content can be deepened through multi-mode hierarchical modeling, the integral classification features are fully utilized as constraint information, and the model prediction accuracy is improved.

Fig. 7 shows a schematic structural diagram of a classification module provided in an exemplary embodiment of the present application.

The classification module 305 comprises a label classifier 501, a label classifier 502 and a label classifier 503, wherein the label classifier 501, the label classifier 502 and the label classifier 503 are a three-level hierarchy from thick to thin, the label classifier 502 is constrained by the label classifier 501, and the label classifier 503 is constrained by the label classifier 502. The input to the label classifier 501 is the fused feature and the output is the first class label. The label classifier 501 provides the first class labels to the label classifier 502. The input to the label classifier 502 is a first class label and fusion feature and the output is a second class label. The label classifier 502 will provide its hidden layer features and secondary class labels to the label classifier 503. The input of the label classifier 503 is the hidden layer feature, the second class label and the fusion feature of the label classifier 502, and the output is the third class label. In summary, the classification module 305 passes information between the label classifiers through the hierarchy described above to generate classification labels of different granularities.

However, the classification module 305 has a problem in that the secondary class labels are affected by the primary class labels, and the tertiary class labels are affected by the primary class labels and the secondary class labels. The first class label and the second class label belong to label embedding vectors (label embedding) and have obvious constraint action. Under the condition that the first class label is wrong, the second class label and the third class label are bound to be wrong. Thus, the classification module 305 is optimized and expanded, resulting in the classification module 500.

The classification module 500 includes a label classifier 501, a label classifier 502, and a label classifier 503. However, the tag classifiers within classification module 500 are connected differently than classification module 305. The hidden layer features of all the label classifiers are fused to obtain an overall classification feature 504, the overall classification feature 504 is input into each label classifier as an input part, and in addition, the target classification features output by adjacent label classifiers are also input into the label classifiers. And the input to each tag classifier consists of two parts: (1) global classification features 504; (2) target classification features output by the adjacent label classifier. For example, the input of the label classifier 501 is the overall classification feature 504 and the target classification feature output by the label classifier 502, the input of the label classifier 502 is the overall classification feature 504, the target classification feature output by the label classifier 501 and the label classifier 503, and the input of the label classifier 503 is the overall classification feature 504 and the target classification feature output by the label classifier 502. The classification labels obtained by the method comprise the integral classification features 504 and the information of the classification labels output by the adjacent label classifiers, so that the accuracy of the classification labels can be effectively improved. Moreover, a certain class of classification label is simultaneously influenced by the integral classification characteristic and the adjacent classification label, but not only by the previous class of classification label, so that the constraint on the classification label is weakened, and even if the first class label is wrong, the second class label and the third class label are not also wrong.

Fig. 8 is a flowchart illustrating a video classification method according to an exemplary embodiment of the present application. The method may be performed by the terminal 120 or the server 140 or other computer device shown in fig. 1, the method comprising:

step 801: for the ith granularity classification feature of the m granularity classification features, determining a neighbor granularity classification feature adjacent to the ith granularity classification feature.

Wherein i is a positive integer less than m +1, and the initial value of i is 1.

Illustratively, there are 4 granularity classification features arranged in sequence, then the neighbor granularity classification feature corresponding to the 1 st granularity classification feature is the 2 nd granularity classification feature, the neighbor granularity classification feature corresponding to the 2 nd granularity classification feature is the 1 st granularity classification feature and the 3 rd granularity classification feature, the neighbor granularity classification feature corresponding to the 3 rd granularity classification feature is the 2 nd granularity classification feature and the 3 rd granularity classification feature, and the neighbor granularity classification feature corresponding to the 4 th granularity classification feature is the 3 rd granularity classification feature.

Step 802: and calling a label classifier corresponding to the ith granularity classification characteristic, carrying out data processing on the overall classification characteristic and the neighbor granularity classification characteristic, and outputting the ith target classification characteristic corresponding to the ith granularity classification characteristic.

Optionally, the tag classifier is a fully connected feed forward network. The label classifier is used for combining the integral classification characteristic and the neighbor granularity classification characteristic of the ith granularity classification characteristic to obtain the ith target classification characteristic.

Step 803: and after i is updated to i +1, repeating the two steps until m target classification features are obtained.

The above steps 801 to 802 need to iterate m times, and each iteration obtains one target classification feature until m target classification features are obtained.

In summary, the present embodiment provides a method for determining a target classification feature, and the method may obtain the target classification feature for classification through an overall classification feature and a neighbor granularity classification feature adjacent to the granularity classification feature. The information among different characteristics can be effectively integrated, and the accuracy of the classification label is improved. And (3) weakening the constraint effect of the classification labels, for example, under the condition that the first class label is wrong, the second class label and the third class label are not required to be wrong.

Fig. 9 shows a schematic structural diagram of a fusion module provided in an exemplary embodiment of the present application. Two multi-modal features, i.e., fused image feature and audio feature, are taken as an example for explanation:

on one hand, an image 901 of the target video is input into the flipping-whitening module 903, and the flipping-whitening of the image 901 is realized, so that the subsequent feature extraction process is facilitated. The inverted whitened image is input into the image feature extraction network layer 904, and the image features of the target video are obtained. On the other hand, the audio 902 of the target video is input into the audio feature extraction network layer 905, and the audio feature of the target video is obtained.

The image and audio features of the target video are input into a fusion module 906 to achieve fusion of the image and audio features. Optionally, the manner of fusing the image features and the audio features is at least one of a gate-based fusion method, an attention-based fusion method, and a tensor-based fusion method. Taking a gate-controlled fusion method as an example, the fusion method assigns weights to the image features and the audio features, and calculates a weighted sum of the image features and the audio features to obtain fusion features.

The gating module 907 is used to reduce the structure of the neural network overfitting. The gating module 907 is applied in the training phase of the network to prevent overfitting of the network. The gating module 907 is not necessary during the application phase of the network. In an alternative design, gating module 907 is a dropout layer.

The fully connected layer 908 and SE block 909(S for sequeneze, compression. E for Excitation, active) are used to calibrate the fusion feature. In an alternative design, the SE module 909 is SEContextGating (Sequeeze & Excitation ContextGating, for compression and activation of the up and down gates).

The classification module 910 is configured to classify the calibrated fusion features to obtain a classification label of the target video.

Fig. 10 is a flowchart illustrating a feature fusion method according to an exemplary embodiment of the present application. The method may be performed by the terminal 120 or the server 140 or other computer device shown in fig. 1, the method comprising:

step 1001: based on the attention mechanism, n weights corresponding to the n multi-modal features are determined.

Optionally, n weights corresponding to the n multi-modal features are determined based on a gated attention mechanism. The method is based on weights of dimension-specific scalar multimodal features that are dynamically generated by a gating mechanism.

Step 1002: and performing weighted calculation on the n multi-modal characteristics according to the n weights to obtain a fusion characteristic.

Step 1003: and performing compression processing and activation processing on the fusion features, and outputting intermediate fusion features.

The compression process is used to pool the fused features globally averaged to convert the C H W fused features to 1X 1C.

The activation process is used to make a non-linear transformation of the fused features after the compression process.

Optionally, the compression process and the activation process are implemented by an SE module.

Step 1004: and calibrating the intermediate fusion features, and outputting the optimized fusion features.

Alternatively, the calibration process can be expressed as the following equation: y ═ σ (WX + b) ×, where X is the intermediate fused features of the inputs, Y is the fused features after optimization, σ is the activation function (e.g., Sigmoid function, S-type growth curve), and W and b are parameters obtained by training. σ (WX + b) is a string of 01 vectors to activate or deactivate the intermediate fusion feature. The purpose of this approach is two-fold: the method has the advantages that firstly, the internal association of each modal feature in the intermediate fusion feature is embodied through the activation mechanism, and secondly, the intermediate fusion feature can be further calibrated.

Step 1003 and step 1004 are optional steps, and step 1003 and step 1004 may be executed, or step 1003 and step 1004 may not be executed.

In summary, the present embodiment provides a feature fusion method, which can fuse multi-modal features of a target video and optimize the fused features. The fusion features can depict the target video from different modes, and the fusion features can accurately represent the target video.

In the following embodiments, taking the example that the multi-modal features of the target video include image features, audio features and text features as an example, fig. 11 shows a flowchart of a training method of a video classification model provided in an exemplary embodiment of the present application. The method may be performed by the terminal 120 or the server 140 or other computer device shown in fig. 1, the method comprising:

step 1101: based on the attention mechanism, image weights corresponding to the image features are determined.

Optionally, image weights corresponding to the image features are determined based on a gated attention mechanism.

In another alternative embodiment, image weights corresponding to image features are determined based on a tensor-based approach.

Step 1102: based on the attention mechanism, audio weights corresponding to the audio features are determined.

Optionally, an audio weight corresponding to the audio feature is determined based on the gated attention mechanism.

In another alternative embodiment, the audio weights corresponding to the audio features are determined based on a tensor-based approach.

Step 1103: based on the attention mechanism, text weights corresponding to the image features are determined.

Optionally, a text weight corresponding to the text feature is determined based on a gated attention mechanism.

In another alternative embodiment, a text weight corresponding to a text feature is determined based on a tensor-based approach.

It should be noted that, when step 1101, step 1102 and step 1103 are actually executed, they are not in the order of priority. For example, it is performed in the order of step 1101, step 1102, and step 1103, or in the order of step 1102, step 1103, and step 1101. The embodiments of the present application do not limit this.

Step 1104: and performing weighted calculation on the image features, the audio features and the text features according to the image weight, the audio weight and the text weight to obtain fusion features.

Illustratively, the image feature is denoted by a, the audio feature is denoted by b, the text feature is denoted by c, the image weight is k1, the audio weight is k2, and the text weight is k3, where k1+ k2+ k3 is 1, and then the fusion feature d is a k1+ b k2+ c k 3.

Step 1105: and performing compression processing and activation processing on the fusion features, and outputting intermediate fusion features.

Step 1106: and calibrating the intermediate fusion features, and outputting the optimized fusion features.

It should be noted that step 1105 and step 1106 are optional steps, and step 1105 and step 1106 may be executed, or step 1105 and step 1106 may not be executed.

In summary, the present embodiment provides a method for fusing an image feature, an audio feature, and a text feature, which can fuse the image feature, the audio feature, and the text feature of a target video and optimize the fused features. The fusion features can depict the target video from three modes of images, audios and texts, and the fusion features can accurately represent the target video.

Fig. 12 is a flowchart illustrating a method for training a video classification model according to an exemplary embodiment of the present application. The video classification model comprises a feature extraction network layer, a feature fusion network layer and a classification network layer, and the method can be executed by the terminal 120 or the server 140 shown in fig. 1 or other computer devices, and comprises the following steps:

step 1202: and acquiring a sample training set.

The sample training set comprises a sample video and a real label corresponding to the sample video;

the sample training set includes one or more sample videos.

Step 1204: and calling a feature extraction network layer, carrying out data processing on the sample video, and outputting n sample multi-modal features.

The n sample multimodal features include at least two sample multimodal features. Illustratively, the n sample multimodal features include at least two of a sample image feature, a sample audio feature, and a sample text feature, and n is a positive integer greater than 1.

Step 1206: and calling a feature fusion network layer, carrying out fusion processing on the n sample multi-modal features, and outputting the sample fusion features.

The sample fusion feature refers to a feature containing n kinds of multi-modal features.

Optionally, a vector bit-wise multiplication method is adopted to fuse the n sample multi-modal features.

Step 1208: and calling a classification network layer, respectively carrying out data processing on the n sample multi-modal characteristics and the sample fusion characteristics, and respectively outputting n sample granularity classification labels and sample fusion classification labels.

Optionally, the classification network layer includes n +1 cascaded tag classifiers, and the n +1 cascaded tag classifiers correspond to the n sample multimodal features and the sample fusion features one to one.

Step 1210: and respectively calculating the cross entropy between the n sample granularity classification labels and the real labels to obtain n granularity cross entropies.

The n granularity cross entropies are used for measuring the difference information between the n sample granularity classification labels and the real labels.

Optionally, the n granularity cross entropies include an image cross entropy, an audio cross entropy, and a text cross entropy. The image cross entropy is the cross entropy corresponding to the sample image features, the audio cross entropy is the cross entropy corresponding to the sample audio features, and the text cross entropy is the cross entropy corresponding to the sample text features.

Illustratively, as shown in fig. 13, when n sample multi-modal features include a sample image feature 1301, a sample audio feature 1302, and a sample text feature 1303, a label classifier 1305 is invoked to perform data processing on the sample image feature 1301 to obtain a sample image classification label; and obtaining the image cross entropy according to the cross entropy between the sample image classification label and the real label. Calling a label classifier 1306, and performing data processing on the sample audio features 1301 to obtain sample audio classification labels; the audio cross entropy 1310 is obtained according to the cross entropy between the sample audio classification label and the real label. Calling a label classifier 1307, and performing data processing on the sample text features 1301 to obtain sample text classification labels; and obtaining a text cross entropy 1311 according to the cross entropy between the sample text classification label and the real label.

Step 1212: and calculating the cross entropy between the sample fusion classification label and the real label to obtain the fusion cross entropy.

For example, as shown in fig. 13, a label classifier 1308 is invoked to perform data processing on the sample fusion features 1304 to obtain sample fusion classification labels; and obtaining a fusion cross entropy according to the cross entropy between the sample text classification label and the real label.

It should be noted that, the steps 1210 and 1212 are not in sequence when executed. Step 1210 may be performed first, followed by step 1212. Step 1212 may be performed first, followed by step 1210.

Step 1214: and training the video classification model according to the n granularity cross entropies and the fusion cross entropies.

In the training process, if only one modality is considered for training, the video classification model may be fitted to the easily converged modality, thereby affecting the learning of the network parameters. To avoid this, relative entropy is introduced to supervise the learning of the video classification model. At this time, this step includes the following substeps:

1. and respectively calculating the relative entropies of the n sample multimodal characteristics and the sample fusion characteristics to obtain n granularity relative entropies.

Optionally, the n granularity relative entropies include at least two of an image cross entropy, an audio cross entropy, and a text cross entropy. The n granularity relative entropies are used for describing information entropy difference values between the n granularity cross entropies and the fusion cross entropy.

Illustratively, when the n sample multi-modal features include a sample image feature 1301, a sample audio feature 1302 and a sample text feature 1303, the relative entropy between the probability distribution of the sample image feature and the probability distribution of the sample fusion feature is calculated, resulting in an image relative entropy 1312 of the n granularity relative entropies. And calculating relative entropy between the probability distribution of the sample audio features and the probability distribution of the sample fusion features to obtain audio relative entropy 1313 of the n granularity relative entropies. And calculating a relative entropy between the probability distribution of the sample text features and the probability distribution of the sample fusion features to obtain a text relative entropy 1314 of the n granularity relative entropies.

2. And calculating the sum of the n granularity relative entropies to obtain the relative entropy loss.

Alternatively, the sum of the n granularity relative entropies is directly calculated, resulting in a relative entropy loss. Or, weights are given to the n granularity relative entropies; and weighting and calculating the sum of the n granularity relative entropies to obtain the relative entropy loss.

3. And training the video classification model according to the n granularity cross entropies and the fusion cross entropies by taking the minimized relative entropy loss as a training target.

Optionally, with the minimized relative entropy loss as a training target, the model parameters of the video classification model are adjusted according to the n granularity cross entropies and the fusion cross entropies and the error back propagation algorithm, and the video classification model is trained.

In summary, the video classification model is trained in the present embodiment. The accuracy of the model can be effectively improved, and the convergence speed of the model is accelerated. But also the introduction of relative entropy can enhance the synergy of multiple modalities.

In the following embodiment, taking n sample multi-modal features including sample image features, sample audio features and sample text features as an example for explanation, fig. 14 shows a flowchart of a training method of a video classification model provided in an exemplary embodiment of the present application. The video classification model comprises a feature extraction network layer, a feature fusion network layer and a classification network layer, and the method can be executed by the terminal 120 or the server 140 shown in fig. 1 or other computer devices, and comprises the following steps:

step 1401: and acquiring a sample training set.

the sample training set includes one or more sample videos.

Step 1402: and calling an image feature extraction network layer, carrying out data processing on the sample video, and outputting the image features of the sample.

Optionally, extracting a sample target video frame from the sample video; calling an image feature extraction network, carrying out data processing on the sample target video frame, and outputting sample video frame features of the sample video frame; and fusing the sample video frame characteristics to obtain the sample image characteristics of the sample video.

Step 1403: and calling an audio characteristic extraction network layer, carrying out data processing on the sample video, and outputting the audio characteristic of the sample.

Optionally, obtaining a mel frequency spectrum diagram of the sample video; and calling an audio characteristic extraction network, carrying out data processing on the Mel frequency spectrogram, and outputting the sample audio characteristics of the sample video.

Step 1404: and calling a text feature extraction network layer, carrying out data processing on the sample video, and outputting the text features of the sample.

Optionally, extracting text content in the sample video; and calling a text feature extraction network, carrying out data processing on the text content and the title of the sample video, and outputting the sample text feature of the sample video.

It should be noted that, when step 1402, step 1403, and step 1404 are executed, they are not in sequence.

Step 1405: and calling a characteristic fusion network layer, carrying out fusion processing on the sample image characteristic, the sample audio characteristic and the sample text characteristic, and outputting the sample fusion characteristic.

The sample fusion features refer to features including sample image features, sample audio features, and sample text features.

Optionally, a vector bit-wise multiplication method is adopted to fuse the sample image features, the sample audio features and the sample text features.

Step 1406: and calling a classification network layer, respectively carrying out data processing on the sample image characteristics, the sample audio characteristics, the sample text characteristics and the sample fusion characteristics, and respectively outputting a sample image classification label, a sample audio classification label, a sample text classification label and a sample fusion classification label.

Optionally, the classification network layer includes 4 cascaded tag classifiers, and the 4 cascaded tag classifiers correspond to the sample image feature, the sample audio feature, the sample text feature, and the sample fusion feature one to one.

Step 1407: and calculating the cross entropy between the sample image classification label and the real label to obtain the image cross entropy.

For example, as shown in fig. 13, a label classifier 1305 is called to perform data processing on a sample image feature 1301 to obtain a sample image classification label; and obtaining the image cross entropy according to the cross entropy between the sample image classification label and the real label.

Step 1408: and calculating the cross entropy between the sample audio classification label and the real label to obtain the audio cross entropy.

Illustratively, as shown in fig. 13, a tag classifier 1306 is invoked to perform data processing on a sample audio feature 1301, so as to obtain a sample audio classification tag; the audio cross entropy 1310 is obtained according to the cross entropy between the sample audio classification label and the real label.

Step 1409: and calculating the cross entropy between the sample text classification label and the real label to obtain the text cross entropy.

Illustratively, as shown in fig. 13, a label classifier 1307 is called to perform data processing on the sample text feature 1301 to obtain a sample text classification label; and obtaining a text cross entropy 1311 according to the cross entropy between the sample text classification label and the real label.

Step 1410: and calculating the cross entropy between the sample fusion classification label and the real label to obtain the fusion cross entropy.

It should be noted that, when step 1407, step 1408, step 1409 and step 1410 are executed, the order is not sequential.

Step 1411: and training the video classification model according to the image cross entropy, the audio cross entropy, the text cross entropy and the fusion cross entropy.

In an alternative design of the present application, relative entropy is introduced to supervise the learning of the video classification model: calculating the relative entropy between the probability distribution of the sample image characteristics and the probability distribution of the sample fusion characteristics to obtain the image relative entropy in the n granularity relative entropies; calculating the relative entropy between the probability distribution of the sample audio features and the probability distribution of the sample fusion features to obtain the audio relative entropy in the n granularity relative entropies; and calculating the relative entropy between the probability distribution of the sample text features and the probability distribution of the sample fusion features to obtain the text relative entropy in the n granularity relative entropies. And calculating the sum of the image relative entropy, the audio relative entropy and the text relative entropy to obtain the relative entropy loss. And training the video classification model by taking the minimized relative entropy loss as a training target according to the image cross entropy, the audio cross entropy, the text cross entropy and the fusion cross entropy.

In summary, in the embodiment, the video classification model is trained, and the sample image features, the sample audio features and the sample text features are introduced into the training of the video classification model. The accuracy of the model can be effectively improved, and the convergence speed of the model is accelerated. But also introduces relative entropy that can enhance the synergy of image, audio and text modalities.

In the following embodiment, a target video is recommended to a user according to a category label of the video, and fig. 15 shows a flowchart of a video recommendation method provided in an exemplary embodiment of the present application. The method comprises the following steps:

step 1501: and acquiring n multi-modal characteristics of the video to be recommended.

The video to be recommended may be any video. The video to be recommended may be a local video stored in the computer device, may also be a video downloaded from a network, and may also be a video provided by other computer devices, which is not limited in this embodiment of the application.

In an embodiment of the application, the n multi-modal features comprise at least two of image features, audio features and text features, and n is a positive integer greater than 1.

Step 1502: and fusing the n multi-modal characteristics to obtain fused characteristics.

Step 1503: and classifying the fusion features according to the m classification particle sizes to obtain the whole classification features and m particle size classification features of the video to be recommended.

The m classification granularity is used for representing m interrelated granularity for classification in a target dimension, and m is a positive integer. The target dimension refers to a dimension for classifying the fusion features, and the obtained m-type particle size classification features are all features corresponding to the target dimension.

The overall classification feature is a feature that includes m granular classification features. The overall classification characteristic may include a complete m granular classification characteristics, that is, may include a partial characteristic of the m granular classification characteristics.

Step 1504: and obtaining m-level classification labels of the videos to be recommended according to the overall classification features and the m granularity classification features.

Step 1505: and recommending the video to be recommended to the target user account under the condition that the recommendation tag of the target user account is matched with the m-level classification tag.

Optionally, the recommendation tag of the target user account is determined according to the historical browsing record of the target user account. Alternatively, the recommendation tag of the target user account is set by the user.

Optionally, in the case that the recommendation tag of the target user account is identical to the m-level classification tag, recommending the video to be recommended to the user account. Optionally, in the case that the recommendation tag of the target user account is partially the same as the m-level classification tag, recommending the video to be recommended to the user account.

Illustratively, the recommendation label of the target user account is "science-mobile phone-domestic mobile phone", the classification label of the video 1 to be recommended is "science-mobile phone-foreign mobile phone", and the classification label of the video 2 to be recommended is "life-food-chinese food", because the classification label of the video 2 to be recommended is completely different from the recommendation label of the target user account, and the classification label of the video 1 to be recommended is partially the same as the recommendation label of the target user account, the video 1 to be recommended is recommended to the target user account.

In summary, the present embodiment provides a method for recommending a video to a user account, and the method obtains the recommendation of the user account to recommend the video to the user account in a targeted manner, so as to ensure that the recommended video meets the preference of the user.

In the following embodiments, a target video is recommended to a user according to a classification label of the video, and fig. 16 shows a flowchart of a video recommendation method provided in an exemplary embodiment of the present application.

Step 1601: and acquiring the image characteristics of the video to be recommended.

Optionally, extracting a target video frame from a video to be recommended; calling an image feature extraction network, performing data processing on the target video frame, and outputting the video frame feature of the target video frame; and fusing the video frame characteristics to obtain the image characteristics of the video to be recommended.

Step 1602: and acquiring the audio characteristics of the video to be recommended.

Optionally, obtaining a Mel frequency spectrogram of a video to be recommended; and calling an audio characteristic extraction network, carrying out data processing on the Mel frequency spectrogram, and outputting the audio characteristics of the video to be recommended. Optionally, a Vggish network is called to perform data processing on the Mel frequency spectrogram and output the audio features of the video to be recommended.

Step 1603: and acquiring the text characteristics of the video to be recommended.

Optionally, extracting text content in the video to be recommended; and calling a text feature extraction network, carrying out data processing on the text content and the title of the video to be recommended, and outputting the text feature of the video to be recommended.

Step 1604: and fusing the image characteristic, the audio characteristic and the text characteristic to obtain a fused characteristic.

Step 1605: and classifying the fusion features according to the m classification particle sizes to obtain the whole classification features and m particle size classification features of the video to be recommended.

The m classification granularity is used for representing m interrelated granularity for classification in a target dimension, and m is a positive integer. The target dimension refers to a dimension for classifying the fusion features, and the obtained m-grain size classification features are all features corresponding to the target dimension.

Step 1606: and obtaining m-level classification labels of the videos to be recommended according to the integral classification features and the m granularity classification features.

Step 1605: and recommending the video to be recommended to the target user account under the condition that the recommendation tag of the target user account is matched with the m-level classification tag.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 17, a block diagram of a video classification apparatus according to an embodiment of the present application is shown. The above functions may be implemented by hardware, or may be implemented by hardware executing corresponding software. The apparatus 1700 includes:

a feature extraction module 1701, configured to obtain n kinds of multi-modal features of a target video, where the n kinds of multi-modal features include at least two of an image feature, an audio feature, and a text feature, and n is a positive integer greater than 1;

a fusion module 1702, configured to fuse the n multimodal features to obtain a fused feature;

a classification module 1703, configured to classify the fusion feature according to m classification granularities to obtain an overall classification feature and m granularity classification features of the target video, where the m classification granularities are used to indicate m interrelated granularities for classification in a target dimension, and the m granularity classification features are used to indicate classification features corresponding to the m classification granularities;

the classifying module 1703 is further configured to obtain m-level classification tags of the target video according to the overall classification features and the m granularity classification features, where the m-level classification tags are video tags arranged according to the m classification granularities.

In an optional design, the classification module 1703 is further configured to combine the overall classification features with the m granularity classification features, respectively, to obtain m target classification features; determining the m-level classification labels of the target video according to the m target classification features.

In an optional design, the classifying module 1703 is further configured to, for an ith granularity classification feature of the m granularity classification features, determine a neighbor granularity classification feature adjacent to the ith granularity classification feature, where i is a positive integer smaller than m +1, and an initial value of i is 1; calling a label classifier corresponding to the ith granularity classification characteristic, performing data processing on the integral classification characteristic and the neighbor granularity classification characteristic, and outputting an ith target classification characteristic corresponding to the ith granularity classification characteristic; and after the i is updated to i +1, repeating the two steps until the m target classification features are obtained.

In an optional design, the fusion module 1702 is further configured to invoke m cascaded tag classifiers, and perform data processing on the fusion features to obtain the m granularity classification features; and fusing hidden layer characteristics of the m label classifiers according to the cascade sequence of the m label classifiers to obtain the overall classification characteristics.

In an alternative design, the fusion module 1702 is further configured to determine n weights corresponding to the n multi-modal features based on an attention mechanism; and performing weighted calculation on the n multi-modal characteristics according to the n weights to obtain the fusion characteristics.

In an alternative design, the classification module 1703 is further configured to determine an image weight of the image feature, and determine an audio weight of the audio feature, and determine a text weight of the text feature based on an attention mechanism; the classification module 1703 is further configured to perform weighted calculation on the image features, the audio features, and the text features according to the image weight, the audio weight, and the text weight, and output the fusion features.

In an alternative design, the fusion module 1702 is further configured to perform a compression process and an activation process on the fusion feature, and output an intermediate fusion feature; and calibrating the intermediate fusion features, and outputting the optimized fusion features.

In an alternative design, the feature extraction module 1701 is further configured to extract a target video frame from the target video; calling an image feature extraction network, carrying out data processing on the target video frame, and outputting the video frame feature of the target video frame; and fusing the video frame characteristics to obtain the image characteristics of the target video.

In an alternative design, the feature extraction module 1701 is further configured to obtain a mel-frequency spectrum of the target video; and calling an audio feature extraction network, carrying out data processing on the Mel frequency spectrogram, and outputting the audio features of the target video.

In an alternative design, the feature extraction module 1701 is further configured to extract text content in the target video; and calling a text feature extraction network, carrying out data processing on the text content and the title of the target video, and outputting the text feature of the target video.

To sum up, the present embodiment fuses the multi-modal features of the target video to obtain a fused feature, and determines the classification label of the target video according to the fused feature. The method can make full use of the dependency relationship and constraint information among different modes, and improve the accuracy of the video classification result. Obtaining more multi-modal content understanding characteristics describing video content, including multi-level fine-grained classification information of the video, and assisting content distribution of a recommendation system; the comprehension of the content can fully utilize the title text, the audio content and the video content of the video content as the basis, so that the description is more comprehensive and accurate, and the content distribution efficiency is improved; finer-grained information can be mined through multi-modal hierarchical modeling to deepen the understanding of the video content, the category hierarchical dependency relationship and constraint information are fully utilized, and the accuracy of the model is improved.

Referring to fig. 18, a block diagram of a training apparatus for a video classification model according to an embodiment of the present application is shown. The above functions may be implemented by hardware, or may be implemented by hardware executing corresponding software. The apparatus 1800 includes:

a sample obtaining module 1801, configured to obtain a sample training set, where the sample training set includes a sample video and a real label corresponding to the sample video;

a sample feature extraction module 1802, configured to invoke the feature extraction network layer, perform data processing on the sample video, and output n sample multimodal features, where the n sample multimodal features include at least two of a sample image feature, a sample audio feature, and a sample text feature, and n is a positive integer greater than 1;

a sample fusion module 1803, configured to invoke the feature fusion network layer, perform fusion processing on the n sample multi-modal features, and output a sample fusion feature;

a sample classification module 1804, configured to invoke the classification network layer, perform data processing on the n sample multi-modal features and the sample fusion features, and output n sample granularity classification tags and sample fusion classification tags, respectively;

a training module 1805, configured to calculate cross entropies between the n sample granularity classification labels and the real labels, respectively, to obtain n granularity cross entropies; calculating the cross entropy between the sample fusion classification label and the real label to obtain a fusion cross entropy;

the training module 1805 is further configured to train the video classification model according to the n granularity cross entropies and the fusion cross entropy.

In an optional design, the training module 1805 is further configured to calculate relative entropies between the n granularity cross entropies and the fusion cross entropy, respectively, to obtain n granularity relative entropies; calculating the sum of the n granularity relative entropies to obtain the relative entropy loss; and training the video classification model according to the n granularity cross entropies and the fusion cross entropy by taking the minimized relative entropy loss as a training target.

In an alternative design, the n sample multi-modal features comprise the sample image features; the training module 1805 is further configured to calculate a relative entropy between the probability distribution of the sample image features and the probability distribution of the sample fusion features, so as to obtain an image relative entropy of the n granularity relative entropies.

In an alternative design, the n sample multi-modal features comprise the sample audio features; the training module 1805 is further configured to calculate a relative entropy between the probability distribution of the sample audio features and the probability distribution of the sample fusion features, so as to obtain an audio relative entropy of the n granularity relative entropies.

In an alternative design, the n sample multimodal features include the sample text features; the training module 1805 is further configured to calculate a relative entropy between the probability distribution of the sample text features and the probability distribution of the sample fusion features, so as to obtain a text relative entropy of the n granularity relative entropies.

FIG. 19 is a block diagram illustrating a computer device, according to an example embodiment. The computer device 1900 includes a Central Processing Unit (CPU) 1901, a system Memory 1904 including a Random Access Memory (RAM) 1902 and a Read-Only Memory (ROM) 1903, and a system bus 1905 connecting the system Memory 1904 and the CPU 1901. The computer device 1900 also includes a basic Input/Output system (I/O system) 1906 for facilitating information transfer between devices within the computer device, and a mass storage device 1907 for storing an operating system 1913, application programs 1914, and other program modules 1915.

The basic input/output system 1906 includes a display 1908 for displaying information and an input device 1909, such as a mouse, keyboard, etc., for user input of information. Wherein the display 1908 and input device 1909 are coupled to the central processing unit 1901 through an input-output controller 1910 coupled to the system bus 1905. The basic input/output system 1906 may also include an input/output controller 1910 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1910 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1907 is connected to the central processing unit 1901 through a mass storage controller (not shown) connected to the system bus 1905. The mass storage device 1907 and its associated computer device-readable media provide non-volatile storage for the computer device 1900. That is, the mass storage device 1907 may include a computer device-readable medium (not shown) such as a hard disk or Compact Disc-Only Memory (CD-ROM) drive.

Without loss of generality, the computer device readable media may comprise computer device storage media and communication media. Computer device storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer device readable instructions, data structures, program modules or other data. Computer device storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), CD-ROM, Digital Video Disk (DVD), or other optical, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer device storage media is not limited to the foregoing. The system memory 1904 and mass storage device 1907 described above may be collectively referred to as memory.

The computer device 1900 may also operate as a remote computer device connected to a network via a network, such as the internet, according to various embodiments of the present disclosure. That is, the computer device 1900 may connect to the network 1911 through the network interface unit 1912 connected to the system bus 1905, or may connect to other types of networks or remote computer device systems (not shown) using the network interface unit 1912.

The memory further includes one or more programs, the one or more programs are stored in the memory, and the central processor 1901 implements all or part of the steps of the video classification method or the training method of the video classification model by executing the one or more programs.

In an exemplary embodiment, a computer-readable storage medium is further provided, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the video classification method provided by the above-mentioned method embodiments, or the training method of the video classification model.

The present application further provides a computer-readable storage medium, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the video classification method provided in the foregoing method embodiments, or the training method of the video classification model.

The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the video classification method or the training method of the video classification model provided in the above embodiment.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for video classification, the method comprising:

fusing the n multi-modal features to obtain fused features;

classifying the fusion features according to m classification particle sizes to obtain an overall classification feature and m particle size classification features of the target video, wherein the m classification particle sizes are used for representing m interrelated particle sizes for classification under a target dimension, the m particle size classification features are used for representing classification features corresponding to the m classification particle sizes, and m is a positive integer;

2. The method of claim 1, wherein obtaining m-level classification labels of the target video according to the overall classification features and the m granular classification features comprises:

respectively combining the overall classification features with the m granularity classification features to obtain m target classification features;

determining the m-level classification labels of the target video according to the m target classification features.

3. The method of claim 2, wherein said combining the global classification features with the m granular classification features to obtain m target classification features comprises:

for the ith granularity classification feature in the m granularity classification features, determining a neighbor granularity classification feature adjacent to the ith granularity classification feature, wherein i is a positive integer smaller than m +1, and the initial value of i is 1;

calling a label classifier corresponding to the ith granularity classification characteristic, performing data processing on the integral classification characteristic and the neighbor granularity classification characteristic, and outputting an ith target classification characteristic corresponding to the ith granularity classification characteristic;

and after the i is updated to i +1, repeating the two steps until the m target classification features are obtained.

4. The method according to any one of claims 1 to 3, wherein the classifying the fusion features according to n classification granularities to obtain an overall classification feature and n granularity classification features of the target video comprises:

calling m cascaded label classifiers to perform data processing on the fusion features to obtain m granularity classification features;

and fusing hidden layer characteristics of the m label classifiers according to the cascade sequence of the m label classifiers to obtain the overall classification characteristics.

5. The method according to any one of claims 1 to 3, wherein said fusing said n multi-modal features to obtain a fused feature comprises:

determining n weights corresponding to the n multi-modal features based on an attention mechanism;

and performing weighted calculation on the n multi-modal characteristics according to the n weights to obtain the fusion characteristics.

6. The method of claim 5, wherein the n multi-modal features comprise image features, audio features, and text features;

the attention-based mechanism determining n weights corresponding to the n multi-modal features comprises:

determining an image weight for the image feature and an audio weight for the audio feature and a text weight for the text feature based on an attention mechanism;

the performing weighted computation on the n multi-modal features according to the n weights to obtain the fused feature includes:

and performing weighted calculation on the image features, the audio features and the text features according to the image weight, the audio weight and the text weight, and outputting the fusion features.

7. The method according to any one of claims 1 to 3, further comprising:

performing compression processing and activation processing on the fusion features, and outputting intermediate fusion features;

and calibrating the intermediate fusion features, and outputting the optimized fusion features.

8. The method of any of claims 1 to 3, wherein the n multi-modal features comprise the image feature;

the method further comprises the following steps:

extracting a target video frame from the target video;

calling an image feature extraction network, carrying out data processing on the target video frame, and outputting the video frame feature of the target video frame;

and fusing the video frame characteristics to obtain the image characteristics of the target video.

9. The method according to any one of claims 1 to 3, wherein the n multi-modal features comprise the audio feature;

the method further comprises the following steps:

acquiring a Mel frequency spectrum diagram of the target video;

and calling an audio characteristic extraction network, carrying out data processing on the Mel frequency spectrogram, and outputting the audio characteristic of the target video.

10. The method of any of claims 1 to 3, wherein the n multi-modal features comprise the textual feature;

the method further comprises the following steps:

extracting text content in the target video;

and calling a text feature extraction network, carrying out data processing on the text content and the title of the target video, and outputting the text feature of the target video.

11. A training method of a video classification model is characterized in that the video classification model comprises a feature extraction network layer, a feature fusion network layer and a classification network layer, and the method comprises the following steps:

12. The method of claim 11, wherein the training the video classification model according to the n granularity cross entropies and the fusion cross entropy comprises:

respectively calculating the relative entropy between the n sample multi-modal characteristics and the sample fusion characteristics to obtain n granularity relative entropies;

calculating the sum of the n granularity relative entropies to obtain the relative entropy loss;

and training the video classification model according to the n granularity cross entropies and the fusion cross entropy by taking the minimized relative entropy loss as a training target.

13. The method of claim 12, wherein the n sample multi-modal features comprise the sample image features;

the method further comprises the following steps:

and calculating the relative entropy between the probability distribution of the sample image characteristics and the probability distribution of the sample fusion characteristics to obtain the image relative entropy in the n granularity relative entropies.

14. The method of claim 12, wherein the n sample multi-modal features comprise the sample audio features;

the method further comprises the following steps:

and calculating the relative entropy between the probability distribution of the sample audio features and the probability distribution of the sample fusion features to obtain the audio relative entropy in the n granularity relative entropies.

15. The method of claim 12, wherein the n sample multimodal features comprise the sample text features;

the method further comprises the following steps:

and calculating the relative entropy between the probability distribution of the sample text features and the probability distribution of the sample fusion features to obtain the text relative entropy in the n granularity relative entropies.

16. An apparatus for video classification, the apparatus comprising:

17. A training device for a video classification model, wherein the video classification model comprises a feature extraction network layer, a feature fusion network layer and a classification network layer, the device comprises:

the sample characteristic extraction module is used for calling the characteristic extraction network layer, carrying out data processing on the sample video and outputting n sample multi-modal characteristics, wherein the n sample multi-modal characteristics comprise at least two of sample image characteristics, sample audio characteristics and sample text characteristics, and n is a positive integer greater than 1;

18. A computer device, characterized in that the computer device comprises: a processor and a memory, the memory having stored therein at least one program that is loaded and executed by the processor to implement the video classification method of any of claims 1 to 10 or the training method of the video classification model of any of claims 11 to 15.

19. A computer-readable storage medium, having at least one program code stored therein, which is loaded and executed by a processor to implement the method for video classification according to any one of claims 1 to 10 or the method for training a video classification model according to any one of claims 11 to 15.

20. A computer program product comprising a computer program or instructions, wherein the computer program or instructions, when executed by a processor, implement the video classification method of any of claims 1 to 10 or the training method of the video classification model of any of claims 11 to 15.