CN112232164A

CN112232164A - Video classification method and device

Info

Publication number: CN112232164A
Application number: CN202011077360.3A
Authority: CN
Inventors: 陈观钦; 陈远; 王摘星; 陈斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2021-01-15

Abstract

The embodiment of the application discloses a video classification method and a video classification device applied to the field of artificial intelligence, wherein for a target image sequence from a target video, state information on a preset scene state dimension can be identified from each image to obtain a feature sequence of the target image sequence; performing convolution operation on the characteristic sequence through a video classification model to obtain a plurality of convolution characteristic vectors; respectively carrying out weighted summation on each convolution feature vector through a model to obtain an attention feature vector; and determining a classification result of the target video based on each attention characteristic vector through the model. Therefore, the video classification problem is converted into the classification problem of the multi-dimensional time sequence data, and in the classification of the multi-dimensional time sequence data, multi-scale feature extraction and attention mechanism are combined, so that the classification effect can be effectively improved.

Description

Video classification method and device

Technical Field

The application relates to the technical field of computer vision, in particular to a video classification method and device.

Background

In an information age, a common user can be used as a video producer to produce videos and transmit the videos to a video platform, and the video platform can classify the videos based on requirements, such as whether the videos contain illegal information or whether behaviors of the user in game videos are abnormal or not.

In the related art, a video may be input into a neural network model based on the neural network model, the neural network model may perform image extraction on the video, then perform feature extraction based on the image, and then perform classification based on the extracted features, and the neural network model requires a large amount of computing resources and training data.

Disclosure of Invention

The embodiment of the invention provides a video classification method and a video classification device, which can convert a video classification problem into a classification problem of multi-dimensional time sequence data, avoid the problem that an end-to-end model in the related technology needs a large amount of computing resources and training data, and effectively improve the classification effect by combining multi-scale feature extraction and attention mechanism in the classification of the multi-dimensional time sequence data.

The embodiment of the invention provides a video classification method, which comprises the following steps:

acquiring a target image sequence, wherein the target image sequence comprises N images, the target image sequence is derived from a target video, and N is a positive integer greater than or equal to 1;

identifying the state information of each image in the target image sequence on a preset scene state dimension to obtain a state information subsequence of each image, and obtaining a feature sequence of the target image sequence based on the state information subsequence of each image;

performing convolution operation on the feature sequence through a plurality of feature extraction modules of a video classification model to obtain corresponding convolution feature vectors;

respectively carrying out weighted summation on each convolution feature vector according to a corresponding attention weight matrix through an attention mechanism module of the video classification model to obtain an attention feature vector corresponding to each convolution feature vector;

and determining a classification result of the target video according to the attention feature vector through a classification module of the video classification model.

An embodiment of the present invention provides a video classification device, including:

an image sequence obtaining unit, configured to obtain a target image sequence, where the target image sequence includes N images, where the target image sequence is derived from a target video, and N is a positive integer greater than or equal to 1;

the characteristic sequence acquisition unit is used for identifying the state information of each image in the target image sequence on a preset scene state dimension to obtain a state information subsequence of each image, and obtaining a characteristic sequence of the target image sequence based on the state information subsequence of each image;

the convolution unit is used for performing convolution operation on the feature sequence through a plurality of feature extraction modules of the video classification model to obtain corresponding convolution feature vectors;

the attention mechanism unit is used for respectively carrying out weighted summation on each convolution feature vector according to a corresponding attention weight matrix through an attention mechanism module of the video classification model to obtain an attention feature vector corresponding to each convolution feature vector;

and the classification unit is used for determining the classification result of the target video according to the attention feature vector through a classification module of the video classification model.

In some embodiments of the present invention, there may also be provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method as described above when executing the computer program.

In some embodiments of the invention, there may also be provided a storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the steps of the method as described above.

By adopting the method and the device, the target image sequence from the target video can be obtained, the state information of each image in the target image sequence on the preset scene state dimension is identified, the state information subsequence of each image is obtained, and the characteristic sequence of the target image sequence is obtained based on the state information subsequence of each image; performing convolution operation on the feature sequence through a plurality of feature extraction modules of a video classification model to obtain corresponding convolution feature vectors; respectively carrying out weighted summation on each convolution feature vector according to a corresponding attention weight matrix through an attention mechanism module of the video classification model to obtain an attention feature vector corresponding to each convolution feature vector; and determining a classification result of the target video according to the attention feature vector through a classification module of the video classification model. Therefore, by adopting the scheme of the embodiment, the video classification problem can be converted into the classification problem of multi-dimensional time sequence data, the problem that an end-to-end model needs a large amount of computing resources and training data in the related technology is avoided, and in the classification of the multi-dimensional time sequence data, multi-scale feature extraction and attention mechanism are combined, so that the classification effect can be effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a flowchart of a video classification method according to an embodiment of the present invention;

fig. 1b is a schematic diagram of a video auditing system according to an embodiment of the present invention;

FIG. 2a is a schematic diagram of obtaining a feature sequence according to an embodiment of the present invention;

FIG. 2b is a schematic structural diagram of a video classification model according to an embodiment of the present invention;

FIG. 2c is a more detailed structural diagram of a video classification model according to an embodiment of the present invention;

FIG. 2d is a schematic structural diagram of a feature recalibration module according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a video classification apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a video classification method, a video classification device, computer equipment and a storage medium.

The video classification method is suitable for computer equipment, and the computer equipment can be equipment such as a terminal or a server.

The terminal can be a mobile phone, a tablet computer, a notebook computer and other terminal equipment, and also can be wearable equipment, a smart television or other intelligent terminals with display modules.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, but is not limited thereto.

The video classification apparatus of this embodiment may be integrated in a terminal or a server, and optionally, may be integrated in the terminal or the server in the form of an application program or the like.

The video classification system provided by the embodiment can be used for video classification scenes, such as scenes for classifying game video anomalies (user behavior anomalies and the like). The video classification system may include a classification server, among other things.

The classification server may be configured to obtain a target image sequence, where the target image sequence includes N images, where the target image sequence is derived from a target video, and N is a positive integer greater than or equal to 1; identifying the state information of each image in the target image sequence on a preset scene state dimension to obtain a state information subsequence of each image, and obtaining a characteristic sequence of the target image sequence based on the state information subsequence of each image; performing convolution operation on the feature sequence through a plurality of feature extraction modules of the video classification model to obtain corresponding convolution feature vectors; respectively carrying out weighted summation on each convolution feature vector according to a corresponding attention weight matrix through an attention mechanism module of the video classification model to obtain an attention feature vector corresponding to each convolution feature vector; and determining a classification result of the target video according to the attention feature vector through a classification module of the video classification model.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The embodiments of the present invention will be described from the perspective of a video classification device, which may be specifically integrated in a terminal or a server, for example, may be integrated in the terminal or the server in the form of an application program.

The video classification method provided by the embodiment of the invention can be executed by a processor of a terminal or a server, and the classification of a target video based on an image sequence feature sequence of the target video is realized based on a video classification model in the embodiment.

The video classification model of the embodiment is implemented based on a Computer Vision technology, and the Computer Vision technology (Computer Vision, CV) is a science for researching how to make a machine see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further performing image processing, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes Image processing, Image recognition, Image Semantic Understanding (ISU), Image retrieval, OCR, video processing, video Semantic Understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also includes common biometric technologies such as face recognition, fingerprint recognition, and the like.

As shown in fig. 1a, the flow of the video classification method may be as follows:

101. acquiring a target image sequence, wherein the target image sequence comprises N images, the target image sequence is derived from a target video, and N is a positive integer greater than or equal to 1;

in this embodiment, the target video may be any type of video, such as a television show, a movie, a game video, a live video, an animation, a variety program, and the like, which is not limited in this embodiment.

102. Identifying the state information of each image in the target image sequence on a preset scene state dimension to obtain a state information subsequence of each image, and obtaining a characteristic sequence of the target image sequence based on the state information subsequence of each image;

wherein, it can be understood that the dimension of the feature sequence includes a feature category dimension and a time dimension; each state information subsequence in the feature sequence can be understood as a subsequence corresponding to a time point in a time dimension, and each state information in each state information subsequence can be understood as feature information on one feature type in a feature type dimension. In the state information subsequence, a predetermined scene dimension can be understood as a feature class. It is understood that after the video classification model processes the state information subsequence, the processed sequence can also be considered to have a time dimension and a feature category dimension, but the time dimension and the feature category dimension are the time dimension and the feature category dimension understood by the computer.

In this embodiment, the classification task of the video classification model is not limited, and may be any classification task related to a video, where the video classification task may be related to a designer of the video classification model, for example, for a game video, the classification task may be set as a video abnormal classification, and more specifically, the classification task may be a target virtual user behavior abnormal classification in the game video, or whether a target virtual user starts a plug-in classification in the game video, or whether sensitive information or illegal information exists in a tv series video, and the like, and this embodiment is not limited thereto.

For different classification tasks, different scene state dimensions can be set according to different scene state information of videos required by the classification tasks. The scene state information of the video may be understood as state information of any object included in the video, such as state information of a virtual character in the video, state information of a weapon of the virtual character, game progress state information of the virtual character, distance state information of the virtual character from a specific position, and the like. Wherein, a scene state information can be correspondingly set with a preset scene state dimension.

Taking a game video as an example, assuming that a classification task of a video classification model for the game video is a classification task for identifying whether a virtual user cheats, determining scene state information required for identifying virtual character cheating according to experience may include: the status information of the virtual character, the status information of the weapon of the virtual character, and the enemy status information of the virtual character (which is generally displayed in the video), the preset scene status dimension may be set to include: a virtual character's status dimension, a virtual character's weapon status dimension, a virtual character's enemy status dimension, and so on.

It can be understood that the preset scene state dimension may also be different according to different video classification tasks.

In this embodiment, before step 101, the method may include:

performing frame processing on a target video to obtain an image sequence of the target video;

a target image sequence is obtained from an image sequence of a target video.

When the target image sequence is obtained from the image sequence of the target video, the acquisition interval of the target video can be determined according to the length of the target video, and the image frames are obtained from the image sequence based on the acquisition interval to form the target image sequence.

Or, when the target image sequence is obtained from the image sequence of the target video, the image sequence with the preset duration may be extracted as the target image sequence.

Referring to fig. 1b, the present embodiment further provides a video auditing system, where the processing process before step 101 may be regarded as a data preprocessing process of a target video, and the video auditing system includes the main machine master in fig. 1b, auxiliary machines slave (n) connected to the main machine master, and an auditing module connected to the auxiliary machines slave, where the auditing module of the present embodiment is deployed with a video classification model described below, and can implement a classification task for the target video based on a target image sequence.

In fig. 1b, the main machine master may be connected to the manual review platform and the video database, the main machine may obtain a target video to be reviewed from the video database when a video pull condition is satisfied (e.g., according to a preset video pull time interval), send the target video to any one of the auxiliary machines, and the auxiliary machine may obtain a target image sequence based on the steps of the data preprocessing process, and then send the target image sequence to the review device for video classification. The main machine master can determine which auxiliary machine the target video is sent to based on the transmission bandwidth of each auxiliary machine and the current data processing capacity of each auxiliary machine, and the main machine master can also receive the video classification result returned by the auditing device from the auxiliary machines. And determining whether the target video needs manual review or not based on the video classification result, and if so, sending the target video to a manual review platform to trigger the manual review of the target video.

In this embodiment, the auditing device may be integrated on a computer, and the computer is provided with a GPU (Graphics Processing Unit) cluster or a Docker (container) machine.

The auxiliary machine can perform framing processing on the target video based on a video operation function of the OpenCV to obtain an image sequence of the target video.

In the video auditing system shown in fig. 1b, two parts of data preprocessing and video classification are separated, which is beneficial to improving the prediction efficiency and processing a larger amount of video data.

In this embodiment, the step of "identifying the state information of each image in the target image sequence in the preset scene state dimension to obtain the state information subsequence of each image, and obtaining the feature sequence of the target image sequence based on the state information subsequence of each image" may include:

extracting state information from a plurality of preset scene state dimensions corresponding to the classification task for each image in the target image sequence to obtain a state information subsequence of each image;

and determining a sequence obtained by arranging the state information subsequences of the images according to the time dimension as a characteristic sequence of the target image sequence.

It can be understood that the number of state information of each image in the feature sequence in the feature category dimension is equal to the number of preset scene state dimensions.

For example, taking a target image sequence composed of 1-L images as an example, state information is extracted from M preset scene state dimensions corresponding to each image, and then each image has M feature types in the feature type dimension, and M state information corresponds to M state information, and forms an M-dimensional state information subsequence, and the M-dimensional state information subsequences of the 1-L images are arranged according to a time dimension (e.g., a display time sequence of the images in the target video), so as to obtain a feature sequence (L, M) of L columns and M rows. Where L denotes L time points in the time dimension, M denotes M feature types in the feature type dimension, and may denote M types of state information at each time point in the time dimension.

In this embodiment, the preset scene state dimension in step 102 may be a state dimension of a person (a target virtual object) in a scene of the video image, and the person may further have multiple sub-dimensions, for example, the preset scene state dimension may include a perspective state sub-dimension of the person (if the person is in a perspective state, the state information is 1, otherwise, the person is 0), an occlusion state dimension of the person (if the person is in an occlusion state, the state information is 2, otherwise, the person is 3), a double-lens state of the person (if the person is in a double-lens open state, the state information is 4, the person is in a double-lens closed state, the state information is 5), a distance state between the person and the shooting target (using an actual distance value as the state information), and the like.

In this embodiment, for the classification task, state information of some preset scene state dimensions may be obtained from a relatively fixed position in an image of the target video, so that an area image sequence corresponding to the fixed position may be cut out from the target image sequence of the target video based on the fixed position, and then the state information is extracted from the preset scene state dimensions for the area image sequence to obtain a state information subsequence of each image.

Optionally, the step of "identifying state information of each image in the target image sequence in a preset scene state dimension to obtain a state information subsequence of each image, and obtaining a feature sequence of the target image sequence based on the state information subsequence of each image" may include:

determining a key area corresponding to state information of each preset scene state dimension in an image of a target image sequence under a classification task;

cutting key areas from each image of the target image sequence to obtain a key area sequence;

and extracting state information of the key area of each image in the key area sequence from the corresponding preset scene state dimension to obtain a state information subsequence of each image, so as to obtain a characteristic sequence formed by the state information subsequence of each image based on the time dimension.

In one example, the key area may be a source area of status information of a certain preset scene status dimension, and the area may be fixed, for example, in a game video, status information of a killer of a player is generally displayed in a fixed area, so that in the status dimension of the killer of the player, the key area may be a key area with preset position information such as (20,20), (40,40) for two diagonal coordinates, respectively).

In another example, the key region may be identified based on a trained key region identification model, a training sample of the key region identification model includes a sample image sequence of a video, a label of each image in the sequence includes position information of an expected key region corresponding to each preset scene state dimension of the image, the key region identification model is trained based on the sample image sequence, and the key region identification model may learn commonality of the expected key region of the state information of each preset scene state dimension, so as to realize identification of the key region of each preset scene state dimension in the image.

In one example, the step of "extracting state information of a key region of each image in the key region sequence from a corresponding preset scene state dimension to obtain a state information subsequence of each image, thereby obtaining a feature sequence formed by the state information subsequence of each image based on a time dimension" may be implemented by an image classification model, where a classification task of the image classification model includes a classification task in each preset scene state dimension, and the state information in each preset scene state dimension is a result obtained by classifying the key region in each preset scene state dimension.

The image classification model can also be obtained by training based on a sample image sequence, and for the sample image sequence of the training image classification model, the labels of the images in the sequence also comprise expected key regions corresponding to various preset scene state dimensions and expected state information on the corresponding preset scene creation state dimensions.

The process from the target video to the feature sequence extraction is described below with reference to fig. 2 a.

For example, firstly, performing operations such as frame division processing and sampling on a target video to obtain a target image sequence, assuming that there are 4 frames of images, and the preset scene state dimension includes a perspective state dimension of a player, an occlusion state dimension of the player, a double mirror state dimension of the player, a distance state dimension between the player and a shooting target, and a killing state dimension of the player, wherein, assuming that a classification task of a video classification model is to identify whether the player (virtual object) in the video cheats in a game, the killing state dimension of the player is an important dimension, the killing state dimension of the player can be set as a positioning point dimension, and state information of the positioning point dimension can be separately acquired.

For a target video sequence, a key area of a positioning point dimension can be identified through a first key area identification model and cut to obtain a first key area sequence, and the killing state information of a player (wherein, the player is in a killing enemy state, the killing state information is 6, in a non-killing state, and the killing state information is 7) is identified through an image classification model based on each key area in the first key area sequence, so that the state information of 4 frames of images in the key point dimension is respectively 6, 6, 7 and 7.

For a target video sequence, identifying key regions of non-positioning point dimensions and related to classification tasks through a second key region identification model and cutting to obtain a second key region sequence, identifying state information of a player in each image in perspective state dimensions, occlusion state dimensions, double mirror state dimensions and distance state dimensions of the player and a shooting target through an image classification model on the basis of each key region corresponding to each image in the second key region sequence to obtain a part of state information subsequence of 4 frames of images, such as (13450, 02560, 034100 and 12480), and fusing the state information of the key point dimensions and the state information subsequence on the basis of a time dimension to obtain a feature sequence (134506, 025606, 0341007 and 124807).

The first key area identification model and the second key area identification model can be two different models (different structures, or the same structure but different parameters), so that the accuracy of state information on the dimensionality of the positioning point can be improved, and the classification accuracy of the video classification model can be improved.

In this embodiment, the video classification model is implemented based on an ISU (Image Semantic Understanding) technology in a computer vision technology, such as an IC (Image classification) technology and an Image feature extraction technology.

The training of the video classification model is implemented based on an AI (Artificial intelligence) technology, especially based on an ML (Machine Learning) technology in the Artificial intelligence technology, and more specifically, may be implemented by Deep Learning (Deep Learning) in the Machine Learning.

103. Performing convolution operation on the feature sequence through a plurality of feature extraction modules of the video classification model to obtain corresponding convolution feature vectors;

the plurality of feature extraction modules are at least two feature extraction modules. The convolution operation performed on the feature sequence may be a one-dimensional convolution operation.

Referring to fig. 2b, the video classification model of the present embodiment may include an embedding module, a feature extraction module, an attention mechanism module and a classification module. The classification module comprises a fusion submodule and a classification submodule.

Optionally, before the step of performing convolution operation on the feature sequence through the plurality of feature extraction modules of the video classification model to obtain the corresponding convolution feature vector, the method may further include:

mapping the characteristic sequence based on a time dimension through an embedding module of a video classification model to obtain an embedded characteristic vector with the time dimension;

the step of performing convolution operation on the feature sequence through a plurality of feature extraction modules of the video classification model to obtain corresponding convolution feature vectors may include:

and carrying out convolution operation on the embedded feature vectors through a plurality of feature extraction modules of the video classification model to obtain corresponding convolution feature vectors.

In this embodiment, the main purpose of the embedded module is to convert the state information into information on a dimension that can be understood by the neural network, and optionally, the embedded layer in the embedded module may be one layer or may have multiple layers, which is not limited in this embodiment.

In one example, the embedding module includes an embedding fusion layer, and a first embedding layer corresponding to each of the status information subsequences, each of the first embedding layers mapping one of the status information subsequences to one of the vector subsequences. Optionally, the step of mapping, by an embedding module of the video classification model, the feature sequence based on the time dimension to obtain an embedded feature vector with the time dimension may include:

vector embedding is carried out on state information in the state information subsequences of the images through a first embedding layer, and first vector quantum sequences corresponding to the state information subsequences are obtained;

and fusing the first vector quantum sequence based on the time dimension through the embedding fusion layer to obtain an embedding feature vector.

The first Embedding layer may perform vector Embedding (that is, vector mapping, that is, converting the state information into a vector understandable by a neural network) on each piece of state information in the state information subsequence by using an ID Embedding method, where, during the Embedding, the first Embedding layer may map each piece of state information in the state information subsequence into a first Embedding vector space, convert each piece of state information into a first Embedding sub-vector of M-dimension, add the first Embedding sub-vectors of M-dimension of each piece of state information in the same state information subsequence according to corresponding elements, and convert the state information subsequence into a sequence of M-dimension.

In one example, the first embedding layer comprises a fully-connected layer, wherein if there are consecutive values in the state information of the state information sub-sequence, the state information in the form of consecutive values is converted into a first embedded sub-vector (dense vector) of M dimensions by the fully-connected layer of the first embedding layer.

For example, referring to fig. 2b, feature vectors 1-L are L state information subsequences, each first embedding layer outputs a first vector quantum sequence with M (or other values than M) dimensions, and the fusion layer arranges the L first vector quantum sequences with M dimensions according to a time dimension (corresponding to L), so as to obtain a dimension (L, M) of an embedded feature vector.

In one example, the embedding module includes an embedding fusion layer, and a second embedding layer corresponding to each of the status information subsequences, each of the second embedding layers embedding for one of the status information subsequences. Optionally, the step of mapping, by an embedding module of the video classification model, the feature sequence based on the time dimension to obtain an embedded feature vector with the time dimension may include:

respectively converting the state information in each state information subsequence into a unique hot vector through a second embedding layer, then carrying out vector splicing to obtain a spliced vector, and carrying out vector embedding on spliced backward vectors corresponding to each state information subsequence to obtain a second vector quantum sequence corresponding to each state information subsequence;

and fusing the second vector quantum sequence based on the time dimension through the embedding fusion layer to obtain an embedding feature vector.

When the second embedded layer converts the state information in the state information subsequence into one-hot-only vectors, if the state information has a numerical value of a continuous value type, the continuous value is not converted into one-hot-only vectors, but the numerical value of the continuous value type is directly spliced with other converted single-hot vectors. For the stitched vector, the second embedding layer may map the stitched vector to a second vector quantum sequence (dense vector) using the fully-connected layer.

For example, referring to fig. 2b again, feature vectors 1-L are L state information subsequences, each second embedding layer outputs a second vector quantum sequence with M dimensions, and the fusion layer arranges the L second vector quantum sequences with M dimensions according to the time dimension (corresponding to L), so as to obtain the dimension (L, M) of the embedded feature vectors.

In this embodiment, the two kinds of embedding can be arbitrarily selected, and in one example, the two kinds of embedding schemes can be combined, that is, the embedding module can simultaneously include an embedding fusion layer, a first embedding layer, and a second embedding layer, and the embedding fusion layer fuses vector subsequences output by the first embedding layer and the second embedding layer. In the scheme, the embedded feature vectors have the vector advantages of two embedding modes, the information in the embedded feature vectors is richer, and the classification accuracy of the video classification task can be improved.

Optionally, the step of mapping, by an embedding module of the video classification model, the feature sequence based on the time dimension to obtain an embedded feature vector with the time dimension may include:

and fusing the first vector quantum sequence and the second vector quantum sequence corresponding to each state information subsequence based on the time dimension through the embedding fusion layer to obtain an embedded feature vector with the time dimension.

The first vector quantum sequence and the second vector quantum sequence are fused based on a time dimension, the first vector quantum sequence may be arranged according to the time dimension to obtain a feature matrix of a first channel, the second vector quantum sequence may be arranged according to the time dimension to obtain a feature matrix of a second channel, and the feature matrices of the two channels form a feature matrix of the two channels (similar to an RGB three-channel feature map of an image) to obtain an embedded feature vector. After combination, the data dimensions of the embedded feature vector are (L, M, 2), where 2 represents the features of both channels.

The convolution window is a matrix of convolution layers in the convolution neural network for performing convolution processing on data. The convolution window is used to extract feature information in the input data. The width of the convolution window determines the amount of input data for extracting the characteristic information. In this embodiment, the convolution operation of the feature extraction module on the embedded feature vector may be a one-dimensional convolution operation for the time series data, and the convolution window width may be used to determine a vector subsequence in the time dimension that participates in feature extraction simultaneously in the time series data.

In this embodiment, the feature extraction module may be implemented based on a structure of a CNN (Convolutional Neural Network), and optionally, in an example, the feature extraction module may include multiple feature extraction layers, the feature extraction layers may be sequentially connected, the feature extraction layers may be Convolutional layers or gated Convolutional layers, one Convolutional layer may include multiple Convolutional kernels, and Convolutional kernel parameters of different Convolutional kernels in the same feature extraction layer may be different.

Optionally, the convolution window widths of the feature extraction layers of different feature extraction modules are different; therefore, different convolution feature vectors extracted by different feature extraction modules are different, different feature information can be provided for classification tasks of the video classification model, the richness of the feature information for video classification is improved, and the accuracy of video classification results is improved.

Optionally, the step of performing convolution operation on the embedded feature vector through a plurality of feature extraction modules of the video classification model to obtain a corresponding convolution feature vector may include:

in each feature extraction module, performing one-dimensional convolution operation on the embedded feature vectors based on the connection sequence of the feature extraction layers and the width of a convolution window contained in the feature extraction module to obtain convolution vectors corresponding to each feature extraction layer;

and obtaining the convolution characteristic vectors corresponding to the characteristic extraction modules based on the convolution vectors of the characteristic extraction layers in the characteristic extraction modules.

In this embodiment, the sliding step lengths of different feature extraction modules may be the same or different, and optionally, the sliding step lengths of all feature extraction modules may be set to 1,

the convolution vector output by the feature extraction layer positioned at the last layer in the connection sequence of the feature extraction layers can be used as a convolution feature vector, so that the convolution feature vector of one feature extraction module can comprise convolution features extracted by adopting various convolution kernel parameters, and the information richness of the convolution feature vector is improved.

In an example, a specific scheme of obtaining the convolution vector through the feature extraction module may include:

performing convolution on the vector input into the current feature extraction layer through the current feature extraction layer in each feature extraction module to obtain a convolution vector of the current feature extraction layer, and inputting the convolution vector into a feature extraction layer which is positioned at the last layer behind the current feature extraction layer in the connection sequence until the last feature extraction layer does not exist;

and if the current feature extraction layer is the first feature extraction layer in the connection sequence, inputting the vector of the current feature extraction layer as the embedded feature vector.

For example, referring to fig. 2c, in the feature extraction modules 1 to N, each feature extraction module includes three feature extraction layers connected in sequence, the feature extraction layer at the lowest layer performs shallow feature extraction on a dual-channel feature matrix (rows represent multidimensional variables m at each time point, corresponding feature category dimensions, columns represent length L of a sequence, and corresponding time dimensions) by sliding one-dimensional convolution operation, that is, embedded feature vectors, and then inputs the shallow feature extraction to the feature extraction layer at the second layer, the output of the feature extraction layer at the second layer serves as the input of the feature extraction layer at the highest layer, and finally, vectors output by the feature extraction layer at the third layer can serve as convolution vectors.

The convolution windows of different feature extraction modules are different in width, and the feature extraction modules can realize multi-scale one-dimensional convolution operation. For example, feature extraction layer 1_1 to feature extraction layer N _1 in fig. 2c represent one-dimensional convolution operations of N different convolution window widths, and respectively extract N-gram features of different sequence lengths. The feature extraction layer may include a plurality of convolution kernels (for example, 32 convolution kernels), the feature extraction modules are relatively independent, the convolution of each convolution window width is operated individually, and there are a plurality of convolution kernels of each width, and the convolution kernels representing different parameters can extract features of different aspects, for example, no feature extraction layer can obtain 32 features.

In one example of the present embodiment, the types of feature extraction layers include convolutional layers and gated convolutional layers;

the convolutional layer and the gated convolutional layer may be set in the same feature extraction module, for example, in fig. 2c, the lowest feature extraction layer is set as the convolutional layer, and the upper two feature extraction layers are set as the gated convolutional layers.

Optionally, the step "in each feature extraction module, based on the connection order of the feature extraction layers included in the feature extraction module and the convolution window width, performing one-dimensional convolution operation on the embedded feature vector to obtain the convolution vector corresponding to each feature extraction layer" includes:

when the current feature extraction layer in the feature extraction module is a convolution layer, performing convolution on a vector input to the current feature extraction layer through the current feature extraction layer to obtain a convolution vector of the current feature extraction layer, and inputting the convolution vector to a previous feature extraction layer which is located behind the current feature extraction layer in the connection sequence until the previous feature extraction layer does not exist;

when the current feature extraction layer in the feature extraction module is a gated convolution layer, performing convolution on a vector input into the current feature extraction layer through the current feature extraction layer according to a first convolution kernel parameter to obtain a first convolution sub-vector, performing convolution on the vector input into the current feature extraction layer according to a second convolution kernel parameter to obtain a second convolution sub-vector, converting the first convolution sub-vector based on a conversion function, multiplying the first convolution sub-vector by a second convolution sub-vector according to corresponding elements to obtain a convolution vector of the current feature extraction layer, and inputting the convolution vector to a previous feature extraction layer located behind the current feature extraction layer in a connection sequence until the previous feature extraction layer does not exist;

if the current feature extraction layer is the first feature extraction layer in the corresponding feature extraction module in the connection sequence, the vector input to the current feature extraction layer is the embedded feature vector.

The present embodiment uses the gating mechanism therein for reference, and can better handle the characteristics of longer sequence data length and more useless and redundant sequence features.

Specifically, assuming that a vector input into the gated convolutional layer is a sequence vector X, two identical-width one-dimensional convolution operations with different parameters are performed on X, that is, referring to formulas (1) and (2), firstly, one-dimensional convolution is performed on X based on a first convolution kernel parameter W1, and based on a second convolution kernel parameter W2, one-dimensional convolution is performed on X, then, referring to formula (3), one of the one-dimensional convolution outputs is processed based on a conversion function (such as a Sigmoid function), and then, the one-dimensional convolution output is multiplied by a corresponding element of the other one-dimensional convolution output, so that the convolution output of the gated convolutional layer is obtained. The gated convolution structure can perform gated filtering on each subsequence of the same window width in the input sequence, strengthen the subsequences which are important for the target classification task and weaken other irrelevant subsequences, because a part of subsequences in a video scene are redundant and irrelevant. The formula of the gated convolution structure is shown below, where @ represents a one-dimensional convolution operation and @ represents a corresponding element multiplication.

conv1(X)＝W₁*X+B₁ (1)

conv2(X)＝W₂*X+B₂ (2)

out＝Sigmoid(conv1(X))⊙conv2(X) (3)

In an example, the feature extraction layer may further combine the residual to promote information of a shallow convolution vector in the convolution feature vectors, and optionally, referring to fig. 2c, the feature extraction module further includes a residual addition layer.

The step of performing, in each feature extraction module, one-dimensional convolution operation on the embedded feature vector based on the connection order of the feature extraction layers included in the feature extraction module and the convolution window width to obtain convolution vectors corresponding to each feature extraction layer may include:

performing convolution on a vector input into the current feature extraction layer through the current feature extraction layer in each feature extraction module to obtain a convolution vector of the current feature extraction layer, inputting the convolution vector into a residual addition layer corresponding to the current feature extraction layer and a residual addition layer corresponding to a feature extraction layer positioned at the upper layer behind the current feature extraction layer in the connection sequence, wherein if the current feature extraction layer is the first feature extraction layer positioned in the connection sequence in the corresponding feature extraction module, the vector input into the current feature extraction layer is an embedded feature vector;

and summing the convolution vector output by the current feature extraction layer and the convolution vector output by the next feature extraction layer which is positioned before the current feature extraction layer in the connection sequence through a residual addition layer corresponding to the current feature extraction layer, and inputting the vector obtained by summation to the previous feature extraction layer which is positioned after the current feature extraction layer in the connection sequence.

Wherein, the number of residual feature extraction layers can be obtained by subtracting 2 from the number of feature extractions.

For example, also taking fig. 2c as an example, in fig. 2c, three feature extraction layers are provided with a residual addition layer, the lowest feature extraction layer inputs the extracted convolution vectors into the residual addition layer and the second feature extraction layer, and the residual addition layer adds the convolution vectors extracted by the first feature extraction layer and the second feature extraction layer and inputs the result into the third feature extraction layer.

In this embodiment, the number of convolution kernels of different feature extraction layers in the same feature extraction module may be the same or different, and in one example, is set to be the same, such as 32. The convolution vector output by the third layer of feature extraction layer is 32-dimensional in the feature class dimension. Assuming that the sliding step size of the one-dimensional convolution operation is 1, the convolution vector output by the third layer feature extraction layer is the matrix feature of (L, 32).

104. Respectively carrying out weighted summation on each convolution feature vector according to a corresponding attention weight matrix through an attention mechanism module of the video classification model to obtain an attention feature vector corresponding to each convolution feature vector;

the attention mechanism module in this embodiment utilizes an attention mechanism. The attention mechanism may enable the neural network to have the ability to focus on a subset of its inputs (or features), i.e., select a particular input.

The attention mechanism can be intuitively interpreted using the human visual mechanism. For example, our vision system tends to focus on some information in the image that assists in the determination, and ignore irrelevant information. Also, in questions related to language or vision, some parts of the input may be more helpful to decision making than others. For example, in a video classification task, some features in the input sequence may have a stronger impact on the classification task than other features, so the attention mechanism module may help perform the classification task by allowing the model to dynamically focus on portions of the convolved feature vectors that help perform the classification task, such as by weighting portions of the convolved feature vectors that help perform the classification task more heavily.

Thus, based on the attention mechanism, the information ratio of the features important for the classification task in the convolution feature vector can be further enhanced.

In this embodiment, the attention mechanism module includes an attention weight matrix corresponding to each feature extraction module, and each attention weight matrix includes an attention weight corresponding to a vector subsequence of a convolution feature vector at each time point in the time dimension, that is, in this embodiment, the convolution feature vector of each feature extraction module may be weighted by using a corresponding attention weight matrix.

The step of respectively performing weighted summation on each convolution feature vector according to the corresponding attention weight matrix through an attention mechanism module of the video classification model to obtain an attention feature vector corresponding to each convolution feature vector may include:

and performing weighted summation on the vector subsequences of the convolution characteristic vectors output by the characteristic extraction modules at each time point in the time dimension according to the attention weights corresponding to the vector subsequences at each time point through the attention weight matrix corresponding to each characteristic extraction module in the attention mechanism module to obtain the attention characteristic vectors of each convolution characteristic vector.

For example, taking a convolution feature vector (L,32) as an example, in the attention weight matrix, a weight is set to the quantum sequence (32 dimensions) for each of L rows, i.e. 32 vectors in each row share a weight. Thus, by processing with an attention mechanism, important segments in the convolved feature vectors (i.e. vector subsequences that are important in the time dimension) can be highlighted.

The specific principle of the attention mechanism is as shown in the following formula (4). Wherein H is the convolution feature vector output by the feature extraction module, W and V are learnable parameters, and each row (corresponding to each row) in the matrix of the convolution feature vector can be learntL dimension) of the weight a_i。

a_i＝softmax(V_itanh (W_iH^T) (4)

Among them, tanh generally refers to hyperbolic tangent, which is one of hyperbolic functions.

105. And determining a classification result of the target video according to the attention feature vector through a classification module of the video classification model.

Optionally, in this embodiment, the classification module may include a fusion sub-module and a classification sub-module. Step 105 may specifically include:

fusing each attention characteristic vector through a fusion submodule of the video classification model to obtain a fused vector;

and determining a classification result of the target video according to the fused vector through a classification submodule of the video classification model.

Based on the foregoing description, the convolution feature vector also has a time dimension and a feature type dimension, the third directional quantum sequence of the convolution feature vector at each time point of the time dimension is composed of the corresponding vector at each feature type of the feature type dimension at the time point, and the fourth directional quantum sequence of the convolution feature vector at each feature type of the feature type dimension is composed of the corresponding vector at each time point of the time dimension at the feature type.

Optionally, referring to fig. 2c, in the video classification model, the video classification model further includes pooling modules corresponding to the feature extraction modules; the fusion submodule includes: a feature fusion submodule and a multi-scale feature fusion submodule;

the video classification method of this embodiment may further include:

inputting the convolution feature vectors output by each feature extraction module into the corresponding pooling module;

and performing pooling operation on each fourth-direction quantum sequence in the convolution feature vector through a pooling module to obtain a pooled feature vector corresponding to the convolution feature vector.

Wherein the pooling operation may be a maximum pooling operation max-pooling), which is to take the point with the largest value in the local acceptance domain. Wherein the pooling operation may also be mean-pooling, i.e. averaging all values in the local acceptance domain.

For example, also take a convolution feature vector (L,32) as an example, where L represents a time dimension, 32 represents a feature class dimension, and (L) fourth-direction quantum sequences on 32 feature classes are respectively subjected to maximum pooling, that is, the maximum vector element is taken from the L vectors in each column, and the obtained 32-dimensional output vector is a pooled feature vector. And performing mean pooling on the fourth-direction quantum sequence of the convolution feature vector (L, M), namely, averaging L vector elements of each column to obtain a 2-dimensional output vector which is the pooled feature vector.

Under the condition that only attention feature vectors exist and no pooling feature vectors exist, the attention feature vectors can be spliced, or the attention feature vectors are fused in the modes of sequential arrangement of the feature extraction modules and the like, so that fused vectors are obtained.

Among the schemes in which pooled feature vectors exist, there can be roughly divided into two fusion schemes, one is pre-fusion and one is post-fusion.

In one example, the pre-fusion scheme may be a feature dimension reduction based fusion scheme.

Optionally, the step of fusing the attention feature vectors through a fusion submodule of the video classification model to obtain fused vectors may include:

fusing the attention feature vectors corresponding to the feature extraction modules and the pooled feature vectors through the feature fusion sub-modules corresponding to the feature extraction modules to obtain fused sub-vectors, and outputting the fused sub-vectors to the multi-scale feature fusion sub-modules;

and fusing the fused sub-vectors output by the feature fusion sub-modules through the multi-scale feature fusion sub-module to obtain fused vectors.

The feature fusion submodule can directly splice the attention feature vector and the pooled feature vector, and the spliced result is used as the fused vector. Or further processing can be carried out after splicing to obtain a fused vector.

Optionally, in an example, a specific scheme of obtaining a fusion sub-vector through each feature fusion sub-module includes:

respectively transforming the corresponding splicing vectors based on the first activation function through each feature fusion submodule to obtain corresponding first transformed vectors;

respectively transforming the corresponding first transformed vectors based on the second activation function through each feature fusion submodule to obtain corresponding first weight vectors;

determining a second weight vector based on the corresponding first weight vector through each feature fusion submodule, taking the first weight vector as the weight vector of the feature vector after pooling, taking the second weight vector as the weight vector of the attention feature vector to perform weighted summation to obtain a fusion sub-vector, and outputting the fusion sub-vector to the multi-scale feature fusion submodule, wherein vector elements at any identical position in the first weight vector and the second weight vector are added to be 1.

The pooled feature vectors obtained after the pooling operation are more concerned about the global features of the target video than the attention mechanism feature vectors, so that the pooled feature vectors and the attention mechanism feature vectors are combined to obtain fused vectors, the fused vectors have good feature representation on the global and local parts of the video, and the accuracy of the classification result is improved.

For example, assuming that the attention feature vector is input1, the pooled feature vector is input2, and the calculation formula of the fused sub-vector output is shown with reference to the formulas (5) - (8).

Input＝concat([input1,input2]) (5)

trans＝Relu(InputW₃ ^T) (6)

gate＝Sigmoid(transW₄ ^T) (7)

output＝Input2⊙gate+Input1⊙(1-gate) (8)

The first activation function of this embodiment may be Relu in equation (6), or may be other available activation functions, and the second activation function may be Sigmoid in equation (7), or may be other available activation functions, which is not limited in this embodiment.

In another example, the step "fusing, by the multi-scale feature fusion submodule, the fusion sub-vectors output by the feature fusion sub-modules to obtain fused vectors" may further include: and splicing the fusion sub-vectors output by the feature fusion sub-modules through the multi-scale feature fusion sub-module to obtain a fused vector.

Namely, the feature fusion results of each scale (convolution window width) are directly spliced to form the multi-scale high-level features. It will be appreciated that, whether only one of the features obtained after the maximum pooling, attention mechanism operation, or both are taken, the resulting fused vector will be N x 32 fused vector due to a convolution kernel with a total of N scales.

In one example, the post-fusion scheme may be a fusion scheme that keeps the feature dimensions unchanged.

For example, with the fusion submodule, the feature of the attention mechanism and the maximum pooling output of each scale is directly reserved, that is, the attention feature vector and the pooling feature vector of each scale (N scales) are reserved to form 2N 32-dimensional feature vectors, and the feature vectors forming 2N × 32 are directly and simply spliced.

In an alternative example, the fusion may be performed by using a feature recalibration method. The feature re-calibration can be understood as learning the importance degree of different features in the input features, improving the features which are important to the classification task in the input features and inhibiting the features which are not useful to the classification task.

Optionally, the video classification model further includes pooling modules corresponding to the feature extraction modules, and the fusion sub-module includes a feature recalibration module.

The video classification method of the embodiment further includes:

The step of fusing the attention feature vectors through a fusion submodule of the video classification model to obtain fused vectors may include:

inputting each pooled feature vector and each attention feature vector into a feature recalibration module as a feature vector to be processed;

determining the importance weight of each feature vector to be processed through a feature weight calibration module;

weighting each feature vector to be processed based on importance weight through a feature recalibration module to obtain weighted feature vectors;

and performing corresponding vector element addition operation on the weighted feature vector, the pooled feature vector and the attention feature vector through a feature recalibration module to obtain a fused vector.

Wherein, the pooling operation may be a maximum pooling operation, which is not described herein.

In this embodiment, the feature recalibration module may further integrate the multi-scale feature extraction module, and the feature vectors of 2N types output through the processing of the attention mechanism and the pooling operation may recalibrate the weight of each of the 2N types of feature vectors according to the overall global information, so as to strengthen the information of the significant width convolution kernel and reduce the weight of the feature maps such as redundancy.

Meanwhile, original characteristic diagram information (2N characteristic vectors) can be continuously added to the result after the recalibration, the scheme can be realized in a residual error connection mode, excessive loss of original input information is avoided, and the subsequent structure learning is not facilitated. Moreover, the residual structure is also beneficial to better training the parameters of the optimization model.

Wherein, the step of determining the importance weight of each feature vector to be processed through the feature re-calibration module may include:

forming a feature vector matrix to be processed by the feature vector to be processed through a feature recalibration module, wherein the dimension of the matrix comprises a feature type dimension;

performing pooling operation on the feature vector matrix to be processed through the feature recalibration module based on the feature type dimension to obtain a global feature matrix, or performing full-connection layer processing on the feature vector matrix to be processed through the feature recalibration module based on the feature type dimension to obtain the global feature matrix;

and calculating the importance weight of each feature vector to be processed through a feature re-calibration module based on the mapping parameters and the global feature matrix.

For example, referring to fig. 2d, a schematic diagram of the structure and data processing of the feature recalibration module of the present embodiment is shown.

The feature vector recalibration structure is expanded from a residual SE (sequence-and-Excitation) structure based on a three-dimensional feature map in the field of image recognition, and the importance of different kinds of features can be considered based on the whole situation. The SE structure can be viewed as a structural special case of a fusion of features.

In the field of image recognition, the general steps of performing feature vector recalibration on feature vectors can be divided into three steps: firstly, obtaining global context characteristics by performing operations such as average pooling or full connection layer on the characteristic vector, secondly, calculating the importance degree of each characteristic in the characteristic vector by operations such as Sigmoid or Softmax, and thirdly, performing characteristic fusion.

In the embodiment, a feature re-calibration scheme is provided for fusing a plurality of one-dimensional feature vectors, instead of using an SE structure on a multi-channel two-dimensional feature map in an image.

As shown in fig. 2d, compared with the SE structure, in the structure of the feature recalibration module in this embodiment, when feature fusion is performed at last, an add operation of residual connection is further added, that is, a residual module is added, and the final fused vector is output by the residual module.

The specific process of obtaining the fused vector is described below with reference to fig. 2 d.

And inputting each pooled feature vector and the attention feature vector into the feature recalibration module as a feature vector to be processed, wherein the feature vector to be processed input into the feature recalibration module forms a feature vector matrix (2N, 32) to be processed.

Firstly, feature transformation is respectively carried out on 2N 32-dimensional feature vectors to be processed through a layer of fully-connected network sharing parameters, and new 2N 32-dimensional feature vectors to be processed in further transformation are obtained. Then, through the processing of average pooling on the feature vector type dimension (corresponding to 2N), namely, the feature vectors to be processed in 2N rows are respectively averaged and pooled according to the rows, a global feature matrix (2N, 1) containing global context features is obtained.

Then, the global feature matrix is input into a fully-connected network layer with an activation function, and the global feature matrix (2N, 1) is converted into a new feature matrix (2N, 1) through mapping of the fully-connected network layer and nonlinear processing of the activation function (such as Relu).

The new feature matrix (2N, 1) is then input into a layer of fully connected network with activation function (Sigmoid or Softmax, etc.). The importance degree of each feature in the 2N features can be calculated through operations such as Sigmoid or Softmax, and a weight matrix (2N, 1) composed of the importance weights of each feature is obtained.

And 2N importance weights are obtained by taking global fusion consideration on the 2N characteristics through two nonlinear layers.

And then, weighting the feature vector matrix (2N, 32) to be processed based on the weight matrix (2N, 1) to obtain weighted feature vectors, namely multiplying the 2N weights by the obtained 2N feature vectors to be processed, performing weight re-calibration on the importance of the 2N feature vectors to be processed, highlighting the important feature vectors to be processed, and weakening useless feature vectors to be processed.

Finally, the design of the residual error module can avoid the transition loss of the original information of the feature vector to be processed, can better optimize network parameters, and perform corresponding element addition operation on the originally input 2N feature vectors (namely the feature vector matrix (2N, 32) to be processed) and the re-calibrated weighted feature vector, wherein the dimensionality of the weighted feature vector is also (2N, 32), so as to obtain the 2N feature vectors of the high layer after global consideration is performed on the basis of the characteristic type dimensionality 2N.

In one example, the classification submodule may include a classification layer and a vector filter fusion layer designed based on a highway layer structure. The highway structure is a learnable threshold mechanism, and can automatically give weight to gates between connection layers in the training process, automatically discard some unimportant connection layers, and automatically determine how many connection layers are needed, thereby being beneficial to solving the problem of difficulty in deep network training.

The vector filtering fusion layer can perform element-level gating filtering and global feature fusion on the fused vector output by the fusion layer, which is equivalent to performing further gating filtering and feature transformation on the output result of each convolution kernel. After all, different convolution window widths are hyper-parameters, so that multi-scale features and feature elements of each dimension can be further adaptively fused, and the result is more stable.

Optionally, the step of determining the classification result of the target video according to the fused vector by using a classification submodule of the video classification model may include:

through the vector filtering fusion layer, the fusion backward quantity is transformed based on a third activation function to obtain a third weight vector, and the fusion backward quantity is transformed based on a fourth activation function to obtain a second transformed vector;

obtaining a fourth weight vector based on the third weight vector through a vector filtering and fusing layer, wherein vector elements at any identical position in the third weight vector and the fourth weight vector are added to be 1;

through a vector filtering fusion layer, taking the third weight vector as the weight vector of the second transformed vector, and taking the fourth weight vector as the weight vector of the fused vector to carry out weighted summation to obtain a classification vector;

through the classification layer, a classification result of the target video is determined based on the classification vector.

The formula of the vector filter fusion layer is as follows:

gate＝sigmoid(Input W₃ ^T) (9)

trans＝tanh(Input W₄ ^T) (10)

output＝trans⊙gate+Input⊙(1-gate) (11)

wherein, Input is a fused vector, namely a 2N × 32 dimensional feature vector, W₃And W₄The matrix included in the calculation formula, sigmoid is a third activation function, gate is a third weight vector, 1-gate is a fourth weight vector, tanh is a fourth activation function, trans is a second transformed vector, and output is a classification vector.

Optionally, in this embodiment, the classification task of the video classification model includes a video anomaly identification task, and the step "determining the classification result of the target video based on the classification vector through the classification layer" may include:

carrying out nonlinear and dimension conversion on the classification vectors through a full connection layer in the classification layer to obtain new classification vectors;

and outputting a classification result of the target video on the classification task based on the new classification vector through a full-connection classification layer in the classification layers.

In one example, the classification task includes a video anomaly identification task, and the step "determining a classification result of the target video according to the fused vector by a classification submodule of the video classification model" may include:

and determining the abnormal video probability of the target video according to the fused vector through a classification submodule of the video classification model.

In this embodiment, the training of the video classification model may be implemented based on a sample image sequence, where the sample image sequence includes a plurality of images, and a label of the sample image sequence includes an expected classification result of a sample video to which the sample image sequence belongs on a classification task.

The method for training the video classification model based on the sample image sequence comprises the following steps:

acquiring a sample image sequence, wherein the sample image sequence comprises N images, the sample image sequence is derived from a sample video, and N is a positive integer greater than or equal to 1;

identifying the state information of each image in the sample image sequence on a preset scene state dimension to obtain a state information subsequence of each image, and obtaining a characteristic sequence of the sample image sequence based on the state information subsequence of each image;

performing convolution operation on the feature sequence through a plurality of feature extraction modules of the video classification model to obtain corresponding convolution feature vectors;

determining a classification result of the target video according to the attention feature vector through a classification module of a video classification model;

determining a classification loss based on the classification result and an expected classification result of the sample image sequence;

parameters of the video classification model are adjusted based on the classification loss.

The detailed steps of each step in the method may refer to the related steps of the actual identification process of the video classification model in the foregoing, and are not described herein again.

The classification loss may be a two-classification cross entropy loss based on Softmax, and the present embodiment may adopt Adam algorithm to optimize parameters of each layer of the model, where the learning rate may be set to 0.0001. In one example, to avoid overfitting, L1 and L2 regularizations may be added to the weight parameters of the last fully-connected layer in the classification layer.

The applicant also experimented with the video classification model in the present embodiment through experiments.

In an experimental environment, the adopted hardware platforms are a core (TM) i7-8700 CPU @3.6GHz processor, a 16G memory, a 256G solid-state disk and a STRIX-GTX1080TI-11G video card. The software platform used was a 64-bit operating system based on window10, python2.7, Tensorflow1.8.

The network structure is classified based on the multidimensional vector sequence in fig. 2c, and the parameters and output dimensions of the respective modules are explained as follows.

The method comprises the steps of preprocessing a single video feature to obtain a sequence with L lengths (L time points), wherein M-dimensional vectors are obtained at each time point, single sample data with the dimensions of (L, M) are used as input samples, and specific structural parameters and output results of a whole video classification model are as follows: (some Drop Out and regularization assist processes that avoid overfitting are not represented in the table below): the specific structural parameters and output results of the characteristic recalibration module are as follows:

TABLE 1 network parameter Table of video classification model

Table 2 network parameter table of characteristic recalibration module

The pure black-to-white ratio after manual review was 6102: 600, evaluated based on Top N and the black-to-white ratio at 80% recall, the predicted effect of the models in each combination is shown in table 3 below,

wherein 3layer cnn represents a multi-scale 3-layer one-dimensional convolution structure,

se denotes the characteristic recalibration structure in the appendix.

max represents the maximum pooling operation described above for the convolution output of the last layer.

attention represents the above mechanism of attention to the convolution output of the last layer.

two forms of output combining maximum pooling and attention mechanisms are denoted as two.

two _ last represents the post-fusion of features, the last two rows in table 3.

From the first four rows in table 3, it can be seen that the effect of adding the attention mechanism layer to the feature of the convolution output is better than that of the maximum pooling convolution feature fusion method, regardless of whether the convolution operation with the residual exists, and it can be seen that the attention mechanism is effective for fusing the convolution feature of the single scale.

Meanwhile, in a convolution feature fusion mode combining the maximum pooling and attention mechanism, the first point is that the comparison of the pre-fusion method for reducing the dimensions of two different features shows that the gate (feature weighted addition) has better effect than the fusion mode of dense (feature is added through a full connection layer) and sum (feature corresponding element addition), namely the effect of performing weighted addition operation on corresponding elements on convolution output of each scale after extracting the two features through the maximum pooling and attention mechanism is better.

The second point is that the comparison of the post-fusion method of making dimension invariance to two different features is found (last two rows in table 3), and the fusion of multiple convolution features based on the feature recalibration module has better effect than the way of directly making concatenation of features.

The third point is that compared with a post-fusion mode of firstly aggregating two features extracted after convolution output of all scales and then performing feature based on the feature recalibration module, the pre-fusion mode of performing weighted fusion on the two features extracted after each convolution output is adopted, and the effect is better. It is seen that the former fusion is better.

Finally, from the overall effect comparison of Table 3, it can be seen that the manner of adding the attention floor to the convolution output features works better than other methods. Even if two characteristics of the maximum pooling and the attention mechanism are combined, different characteristic fusion modes are adopted, and the effect better than that of the attention mechanism is not obtained.

From this, it can be seen that the attention mechanism operation contains to some extent characteristic information of the maximum pooling operation.

TABLE 3 comparison of TOPN Effect of CNN models of different structures

Meanwhile, the last two layers in the 3-layer CNN structure in the video classification model of fig. 2c are replaced with a gate-controlled convolution structure, and the specific gated CNN principle is as described above. From the model in the first 4 rows of table 4, it can be seen that the convolution output characteristic of gated cnn plus the attention suppressing layer works better than the maximal pooled convolution feature fusion mode, regardless of whether the residual convolution operation exists or not, consistent with the conclusion in table 3.

Meanwhile, in a gate cnn convolution feature fusion mode combining a maximum pooling mechanism and an attention mechanism, the first point is that the sum (feature corresponding element addition) has better effect than the fusion mode of dense (feature passing through a full connection layer) and gate (feature weighted addition) by comparing two different feature dimension reduction pre-fusion methods. The second point is that the comparison of the post-fusion method with unchanged dimensionality for two different features (last two rows in table 4) shows that the method for directly splicing the fusion ratio features of multiple convolution features based on the feature recalibration module has poor effect. These two points are different from the results obtained in the model based on the classical CNN structure in table 3.

It can be seen that the maximum pooling characteristic and attention mechanism characteristic obtained based on the gate cnn structure are better in the form of direct addition in the former fusion and direct splicing in the latter fusion, and some complicated characteristic fusion modes are not needed subsequently.

TABLE 4 comparison of TOPN Effect of Gated CNN models of different structures

Furthermore, comparing table 3 and table 4, it can be seen that when there is no residual connection and only the output of the last convolutional layer is taken based on the weighted convolution feature of the attention mechanism, the model based on the classical CNN structure has better effect than the model based on the Gated CNN structure. However, as can be seen from table 5 below, the model based on the Gated cnn structure works better than the classical cnn structure when the residual concatenation of the model structure of fig. 2c is present, and the maximum pooling and attention mechanism features of the convolution output are to be fused.

TABLE 5 TOPN Effect comparison of different CNN models in combination with residual

A comprehensive comparison of tables 3-5 shows that the multi-scale classical CNN structural model combined with the attention mechanism works best. On this test set, if randomly guessed, the accuracy is about 0.09, whereas the model proposed in this application, when taking the top200 of the predicted result, the coverage is 0.246 and the accuracy is 0.78; when the top400 of the prediction result is obtained, the coverage rate is 0.426, and the accuracy rate is 0.675; when the top600 of the prediction result is taken, the coverage rate is 0.531, and the accuracy rate is 0.57. When the coverage is 0.80, the accuracy is 0.33. Compared to random, the accuracy is improved by about 7 times when 50% is covered at the same time.

In this embodiment, the structures and compositions of the feature extraction layer and the fusion layer, etc. may be freely selected according to actual requirements and experimental results, which is not limited in this embodiment.

By adopting the video classification scheme of the embodiment, scene state information can be identified from a frame image, then a multi-dimensional vector sequence is constructed, finally, the improved multi-scale classical CNN sequence model combined with an attention mechanism is used for carrying out abnormity judgment, the size of the abnormal video suspicious probability can be given to a video through the embodiment, the high suspicious video can be preferentially checked manually, and the manual judgment efficiency is improved. Meanwhile, more abnormal videos can be found as far as possible under the condition of limited manpower through the embodiment.

In order to better implement the method, correspondingly, the embodiment of the invention also provides a video classification device which is specifically integrated in the terminal or the server.

Referring to fig. 3, the apparatus includes:

an image sequence acquiring unit 301, configured to acquire a target image sequence, where the target image sequence includes N images, where the target image sequence is derived from a target video, and N is a positive integer greater than or equal to 1;

a feature sequence obtaining unit 302, configured to identify state information of each image in the target image sequence in a preset scene state dimension, to obtain a state information subsequence of each image, and obtain a feature sequence of the target image sequence based on the state information subsequence of each image;

the convolution unit 303 is configured to perform convolution operation on the feature sequence through the multiple feature extraction modules of the video classification model to obtain corresponding convolution feature vectors;

the attention mechanism unit 304 is configured to perform weighted summation on each convolution feature vector according to the corresponding attention weight matrix through an attention mechanism module of the video classification model, so as to obtain an attention feature vector corresponding to each convolution feature vector;

and the classification unit 305 is configured to determine a classification result of the target video according to the attention feature vector through a classification module of the video classification model.

In an optional example, the video classification model further comprises an embedding module, the feature sequence has a time dimension, and the state information subsequences in the feature sequence are arranged according to the time dimension;

the video classification apparatus further includes: an embedding unit configured to:

before convolution operation is carried out on the feature sequences through a plurality of feature extraction modules of the video classification model to obtain corresponding convolution feature vectors, the feature sequences are mapped on the basis of time dimension through an embedding module of the video classification model to obtain embedded feature vectors with time dimension;

through a plurality of feature extraction modules of the video classification model, the feature sequence is subjected to convolution operation to obtain corresponding convolution feature vectors, and the method comprises the following steps:

In an alternative example, the embedding module includes an embedding fusion layer and a first embedding layer and a second embedding layer, an embedding unit to:

In an optional example, one feature extraction module includes at least two feature extraction layers connected in sequence, and the feature extraction layers of different feature extraction modules have different convolution window widths;

a convolution unit to:

In an alternative example, a convolution unit to:

In one optional example, the types of feature extraction layers include convolutional layers and gated convolutional layers;

a convolution unit to:

In an optional example, the feature extraction module further comprises a residual addition layer;

a convolution unit to:

In an optional example, the attention mechanism module includes an attention weight matrix corresponding to each feature extraction module, and each attention weight matrix includes an attention weight corresponding to a vector subsequence of each time point of the convolution feature vector in the time dimension;

and the attention mechanism unit is used for performing weighted summation on the vector subsequences of the convolution feature vectors output by the feature extraction modules at each time point in the time dimension according to the attention weights corresponding to the vector subsequences at each time point through the attention weight matrix corresponding to each feature extraction module in the attention mechanism module to obtain the attention feature vectors of each convolution feature vector.

In an optional example, the classification module includes a fusion sub-module and a classification sub-module;

the classification unit is used for fusing each attention characteristic vector through a fusion submodule of the video classification model to obtain a fused vector;

In an alternative example, the convolution feature vector has a time dimension and a feature class dimension, a third directional quantum sequence of the convolution feature vector at each time point of the time dimension is composed of a corresponding vector at each feature class of the feature class dimension at the time point, and a fourth directional quantum sequence of the convolution feature vector at each feature class of the feature class dimension is composed of a corresponding vector at each time point of the time dimension at the feature class; the video classification model also comprises pooling modules corresponding to the feature extraction modules, and the fusion sub-module comprises a multi-scale feature fusion sub-module and a feature fusion sub-module;

the apparatus further comprises a first pooling unit for:

performing pooling operation on each fourth-direction quantum sequence in the convolution feature vector through a pooling module to obtain pooled feature vectors corresponding to the convolution feature vectors;

a fusion unit to:

In an optional example, the fusion unit is configured to:

splicing the attention feature vectors corresponding to the feature extraction modules and the pooled feature vectors through the feature fusion sub-modules corresponding to the feature extraction modules to obtain spliced vectors;

In an alternative example, the convolution feature vector has a time dimension and a feature class dimension, a third directional quantum sequence of the convolution feature vector at each time point of the time dimension is composed of a corresponding vector at each feature class of the feature class dimension at the time point, and a fourth directional quantum sequence of the convolution feature vector at each feature class of the feature class dimension is composed of a corresponding vector at each time point of the time dimension at the feature class; the video classification model also comprises pooling modules corresponding to the feature extraction modules, and the fusion sub-module comprises a feature re-calibration module;

the apparatus further comprises a second pooling unit for:

a fusion unit to:

determining the importance degree of each feature vector to be processed through a feature re-calibration module to obtain the importance weight of each feature vector to be processed;

In an optional example, the classification submodule includes a vector filter fusion layer and a classification layer;

a classification unit to:

through the vector filtering fusion layer, the fusion backward quantity is transformed based on a third activation function to obtain a second transformed vector, and the fusion backward quantity is transformed based on a fourth activation function to obtain a third weight vector;

through a classification layer, a classification result of the target video on a classification task is determined based on the classification vector.

In an optional example, the classification tasks of the video classification model include a video anomaly identification task;

a classification unit to:

Through the device of this embodiment, with the categorised problem of video conversion to the categorised problem of multidimension chronogenesis data, and in the classification of multidimension chronogenesis data, still combined multi-scale feature extraction and attention mechanism, can effectively promote the classification effect.

In addition, an embodiment of the present invention further provides a computer device, where the computer device may be a terminal or a server, as shown in fig. 4, which shows a schematic structural diagram of the computer device according to the embodiment of the present invention, and specifically:

the computer device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 4 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby monitoring the computer device as a whole. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 via a power management system, so that functions of managing charging, discharging, and power consumption are implemented via the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The computer device may also include an input unit 404, the input unit 404 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the computer device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application programs stored in the memory 402, thereby implementing various functions as follows:

identifying the state information of each image in the target image sequence on a preset scene state dimension to obtain a state information subsequence of each image, and obtaining a characteristic sequence of the target image sequence based on the state information subsequence of each image;

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present invention further provides a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the video classification method provided in the embodiment of the present invention.

According to an aspect of the application, there is also provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations in the embodiments described above.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in the video classification method provided in the embodiment of the present invention, the beneficial effects that can be achieved by the video classification method provided in the embodiment of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The video classification method, apparatus, computer device and storage medium provided by the embodiments of the present invention are described in detail above, and the principles and embodiments of the present invention are explained herein by applying specific examples, and the descriptions of the above embodiments are only used to help understand the method and its core ideas of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of video classification, comprising:

2. The video classification method according to claim 1, wherein the video classification model further comprises an embedding module, the feature sequence has a time dimension, and the status information subsequences in the feature sequence are arranged according to the time dimension;

before performing convolution operation on the feature sequence through the plurality of feature extraction modules of the video classification model to obtain corresponding convolution feature vectors, the method further includes:

mapping the feature sequence based on the time dimension through an embedding module of the video classification model to obtain an embedded feature vector with the time dimension;

the method for performing convolution operation on the feature sequence through the plurality of feature extraction modules of the video classification model to obtain corresponding convolution feature vectors comprises the following steps:

and carrying out convolution operation on the embedded characteristic vectors through a plurality of characteristic extraction modules of the video classification model to obtain corresponding convolution characteristic vectors.

3. The video classification method according to claim 2, wherein the embedding module includes a first embedding layer, a second embedding layer, and an embedding fusion layer, and the mapping the feature sequence based on the time dimension by the embedding module of the video classification model to obtain an embedded feature vector having a time dimension includes:

vector embedding is carried out on the state information in the state information subsequences of the images through the first embedding layer, and a first vector subsequence corresponding to each state information subsequence is obtained;

respectively converting the state information in each state information subsequence into a unique hot vector through the second embedding layer corresponding to each state information subsequence, carrying out vector splicing to obtain a spliced vector, and carrying out vector embedding on a spliced backward vector corresponding to each state information subsequence to obtain a second vector quantum sequence corresponding to each state information subsequence;

and fusing the first vector quantum sequence and the second vector quantum sequence corresponding to each state information subsequence through the embedding fusion layer based on a time dimension to obtain an embedding feature vector with the time dimension.

4. The video classification method according to claim 2, wherein one feature extraction module comprises at least two feature extraction layers connected in sequence, and the convolution window widths of the feature extraction layers of different feature extraction modules are different;

the method for performing convolution operation on the embedded feature vectors through a plurality of feature extraction modules of the video classification model to obtain corresponding convolution feature vectors comprises the following steps:

in each feature extraction module, performing one-dimensional convolution operation on the embedded feature vectors based on the connection sequence of the feature extraction layers and the convolution window width contained in the feature extraction module to obtain convolution vectors corresponding to the feature extraction layers;

and obtaining convolution characteristic vectors corresponding to the characteristic extraction modules based on the convolution vectors of the characteristic extraction layers in the characteristic extraction modules.

5. The video classification method according to claim 4, wherein, in each of the feature extraction modules, performing one-dimensional convolution operation on the embedded feature vectors based on a connection order of feature extraction layers included in the feature extraction module and a convolution window width to obtain convolution vectors corresponding to each feature extraction layer, includes:

performing convolution on the vector input into the current feature extraction layer through the current feature extraction layer in each feature extraction module to obtain a convolution vector of the current feature extraction layer, and inputting the convolution vector into a previous feature extraction layer positioned behind the current feature extraction layer in the connection sequence until the previous feature extraction layer does not exist;

6. The video classification method according to claim 4, wherein the types of the feature extraction layers include a convolutional layer and a gated convolutional layer;

in each of the feature extraction modules, performing one-dimensional convolution operation on the embedded feature vector based on a connection order and a convolution window width of the feature extraction layers included in the feature extraction module to obtain a convolution vector corresponding to each feature extraction layer, including:

when the current feature extraction layer in the feature extraction module is a convolution layer, performing convolution on the vector input to the current feature extraction layer through the current feature extraction layer to obtain a convolution vector of the current feature extraction layer, and inputting the convolution vector to a previous feature extraction layer located behind the current feature extraction layer in the connection sequence until the previous feature extraction layer does not exist;

when the current feature extraction layer in the feature extraction module is a gated convolution layer, performing convolution on a vector input into the current feature extraction layer through the current feature extraction layer according to a first convolution kernel parameter to obtain a first convolution sub-vector, performing convolution on the vector input into the current feature extraction layer according to a second convolution kernel parameter to obtain a second convolution sub-vector, performing conversion on the first convolution sub-vector based on a conversion function, multiplying the first convolution sub-vector by the second convolution sub-vector according to corresponding elements to obtain a convolution vector of the current feature extraction layer, and inputting the convolution vector to a previous feature extraction layer located behind the current feature extraction layer in the connection sequence until the previous feature extraction layer does not exist;

and if the current feature extraction layer is the first feature extraction layer in the corresponding feature extraction module in the connection sequence, the vector input to the current feature extraction layer is the embedded feature vector.

7. The video classification method according to claim 4, wherein the feature extraction module further comprises a residual addition layer;

performing convolution on a vector input into the current feature extraction layer through the current feature extraction layer in each feature extraction module to obtain a convolution vector of the current feature extraction layer, inputting the convolution vector into a residual addition layer corresponding to the current feature extraction layer and a residual addition layer corresponding to a feature extraction layer positioned at the upper layer behind the current feature extraction layer in the connection sequence, wherein if the current feature extraction layer is the first feature extraction layer positioned in the connection sequence in the corresponding feature extraction module, the vector input into the current feature extraction layer is the embedded feature vector;

and summing the convolution vector output by the current feature extraction layer and the convolution vector output by the next feature extraction layer which is positioned in front of the current feature extraction layer in the connection sequence through a residual addition layer corresponding to the current feature extraction layer, and inputting the vector obtained by summation to the previous feature extraction layer which is positioned behind the current feature extraction layer in the connection sequence.

8. The video classification method according to claim 2, wherein the attention mechanism module comprises an attention weight matrix;

the obtaining, by the attention mechanism module of the video classification model, the attention feature vectors corresponding to the convolution feature vectors by performing weighted summation on the convolution feature vectors according to the corresponding attention weight matrices includes:

and performing weighted summation on the vector subsequences of the convolution characteristic vectors output by the characteristic extraction modules at each time point in the time dimension according to the attention weights corresponding to the vector subsequences at each time point through the attention weight matrix corresponding to each characteristic extraction module in the attention mechanism module to obtain the attention characteristic vectors of the convolution characteristic vectors.

9. The video classification method according to any one of claims 1 to 8, characterized in that the classification module comprises a fusion sub-module and a classification sub-module;

the determining, by the classification module of the video classification model, the classification result of the target video according to the attention feature vector includes:

and determining the classification result of the target video according to the fusion backward quantity through a classification submodule of the video classification model.

10. The video classification method according to claim 9, wherein the convolved feature vectors have a time dimension and a feature class dimension, the third directional quantum sequence of the convolved feature vectors at each time point in the time dimension is composed of a corresponding vector at each feature class in the feature class dimension at the time point, and the fourth directional quantum sequence of the convolved feature vectors at each feature class in the feature class dimension is composed of a corresponding vector at each time point in the time dimension at the feature class;

the video classification model also comprises pooling modules corresponding to the feature extraction modules; the fusion submodule includes: a feature fusion submodule and a multi-scale feature fusion submodule;

the video classification method further comprises the following steps:

performing pooling operation on each fourth-direction quantum sequence in the convolution feature vector through the pooling module to obtain a pooled feature vector corresponding to the convolution feature vector;

the fusion submodule of the video classification model fuses the attention feature vectors to obtain fused vectors, and the fusion submodule comprises:

fusing the attention feature vectors corresponding to the feature extraction modules and the pooled feature vectors to obtain fused sub-vectors through the feature fusion sub-modules corresponding to the feature extraction modules, and outputting the fused sub-vectors to the multi-scale feature fusion sub-modules;

and fusing the fused sub-vectors output by the feature fusion sub-modules through the multi-scale feature fusion sub-modules to obtain fused vectors.

11. The video classification method according to claim 10, wherein the fusing the attention feature vector and the pooled feature vector corresponding to each feature extraction module by the feature fusion sub-module corresponding to each feature extraction module to obtain a fused sub-vector, and outputting the fused sub-vector to the multi-scale feature fusion sub-module, includes:

determining a second weight vector based on the corresponding first weight vector through each feature fusion submodule, taking the first weight vector as the weight vector of the pooled feature vector, taking the second weight vector as the weight vector of the attention feature vector, performing weighted summation to obtain a fusion sub-vector, and outputting the fusion sub-vector to the multi-scale feature fusion submodule, wherein vector elements at any identical position in the first weight vector and the second weight vector are added to be 1.

12. The video classification method according to claim 9, wherein the convolved feature vectors have a time dimension and a feature class dimension, the third directional quantum sequence of the convolved feature vectors at each time point in the time dimension is composed of a corresponding vector at each feature class in the feature class dimension at the time point, and the fourth directional quantum sequence of the convolved feature vectors at each feature class in the feature class dimension is composed of a corresponding vector at each time point in the time dimension at the feature class;

the video classification model further comprises pooling modules corresponding to the feature extraction modules, and the fusion sub-module comprises a feature recalibration module;

the video classification method further comprises the following steps:

inputting each pooled feature vector and each attention feature vector into the feature recalibration module as a feature vector to be processed;

determining the importance weight of each feature vector to be processed through the feature weight calibration module;

weighting each feature vector to be processed based on the importance weight through the feature weight calibration module to obtain weighted feature vectors;

and performing corresponding vector element addition operation on the weighted feature vector, the pooled feature vector and the attention feature vector through the feature recalibration module to obtain a fused vector.

13. The video classification method according to claim 9, wherein the classification sub-module comprises a vector filtering fusion layer and a classification layer;

the determining, by the classification submodule of the video classification model, the classification result of the target video according to the fusion backward quantity includes:

through the vector filtering and fusion layer, a fourth weight vector is obtained based on the third weight vector, wherein vector elements at any same position in the third weight vector and the fourth weight vector are added to be 1;

through the vector filtering and fusion layer, taking a third weight vector as the weight vector of the second transformed vector, and taking a fourth weight vector as the weight vector of the fused vector to carry out weighted summation to obtain a classification vector;

determining, by the classification layer, a classification result of the target video based on the classification vector.

14. The video classification method according to claim 9, wherein the classification task of the video classification model comprises a video anomaly identification task;

and determining the abnormal video probability of the target video according to the fusion backward quantity through a classification submodule of the video classification model.

15. A video classification apparatus, comprising: