CN115223079A

CN115223079A - Video classification method and device

Info

Publication number: CN115223079A
Application number: CN202210778969.6A
Authority: CN
Inventors: 高雪松; 王博; 林玥
Original assignee: Hisense Group Holding Co Ltd
Current assignee: Hisense Group Holding Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-21
Also published as: WO2024001139A1

Abstract

The application relates to the technical field of video processing, and discloses a video classification method and a device, wherein the method comprises the following steps: extracting features of a video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel, and sequencing the target feature vectors based on the sizes of the convolution kernels; for any target feature vector, updating the target feature vector based on a target feature vector adjacent to the target feature vector; and performing feature fusion on all the feature vector sequences of the video to be processed and the updated target feature vector to obtain a classification vector representing the category of the video to be processed. The updated target feature vector embodies the incidence relation among different target feature vectors and contains global view information; and then performing feature fusion on the feature vector sequence and the updated target feature vector to obtain a classification vector capable of accurately representing the category of the video to be processed.

Description

Video classification method and device

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video classification method and apparatus.

Background

With the rapid popularization of the mobile internet, videos are popular with people due to rich contents and various expression forms. In order to manage the video conveniently, the video needs to be classified, that is, the category to which the video belongs is determined.

In the related art, two parallel convolutional neural networks (a slow channel and a fast channel) are applied to the same video segment for processing, the slow channel analyzes static content in the video, and the fast channel analyzes dynamic content in the video.

However, the above process may lose part of the spatio-temporal information, such as when constructing a slow channel stream, down-sampling may result in loss of temporal information, so that the accuracy of video classification is reduced.

Disclosure of Invention

The application provides a video classification method and device, which are used for accurately classifying videos.

In a first aspect, an embodiment of the present application provides a video classification method, where the method includes:

extracting features of a video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel, and sequencing the target feature vectors based on the sizes of the convolution kernels;

for any target feature vector, updating the target feature vector based on a target feature vector adjacent to the target feature vector;

and performing feature fusion on all the feature vector sequences of the video to be processed and the updated target feature vector to obtain a classification vector representing the category of the video to be processed.

In a second aspect, an embodiment of the present application provides a video classification apparatus, including:

the feature extraction module is used for extracting features of the video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel, and sequencing the target feature vectors based on the size of the convolution kernels;

the updating module is used for updating the target characteristic vector based on the target characteristic vector adjacent to the target characteristic vector aiming at any target characteristic vector;

and the fusion module is used for performing feature fusion on all the feature vector sequences of the video to be processed and the updated target feature vector to obtain a classification vector representing the category of the video to be processed.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory;

wherein the memory stores program code which, when executed by the processor, causes the processor to perform the video classification method of any of the first aspects.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the method for classifying videos according to any one of the first aspect is implemented.

In addition, for technical effects brought by any one implementation manner of the second aspect to the fourth aspect, reference may be made to technical effects brought by different implementation manners of the first aspect, and details are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a first video classification method provided in an embodiment of the present application;

FIG. 2 is a diagram of a first system architecture provided in an embodiment of the present application;

FIG. 3 is a diagram of a second system architecture provided by an embodiment of the present application;

fig. 4 is a schematic flow chart of a second video classification method provided in the embodiment of the present application;

fig. 5 is a schematic flowchart of a feature vector sequence and a target feature vector determination method provided in an embodiment of the present application;

fig. 6 is a schematic flow chart of a third video classification method provided in the embodiment of the present application;

fig. 7 is a schematic flowchart of a fourth video classification method according to an embodiment of the present application;

FIG. 8 is a schematic flow chart diagram of a target feature vector updating method according to an embodiment of the present application;

fig. 9 is a schematic flow chart of a fifth video classification method according to an embodiment of the present application;

fig. 10 is a schematic flow chart of an adjustment vector determination method according to an embodiment of the present application;

FIG. 11 is a diagram illustrating a process for determining an adjustment vector according to an embodiment of the present disclosure;

fig. 12 is a schematic flow chart of a first feature fusion method provided in the embodiments of the present application;

fig. 13 is a schematic flowchart of a sixth video classification method according to an embodiment of the present application;

fig. 14 is a schematic flowchart of a second feature fusion method provided in an embodiment of the present application;

FIG. 15 is a diagram illustrating vector quantity modification according to an embodiment of the present application;

fig. 16 is a schematic flow chart of a seventh video classification method according to an embodiment of the present application;

fig. 17 is a schematic flowchart of an eighth video classification method according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of a first video classification apparatus according to an embodiment of the present application;

fig. 19 is a schematic structural diagram of a second video classification apparatus according to an embodiment of the present application;

fig. 20 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

In the description of the present application, unless otherwise expressly specified or limited, the term "coupled" is to be construed broadly, e.g., as meaning directly coupled to or indirectly coupled through intervening elements, or as meaning communicating between two devices. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.

The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, the meaning of "a plurality" is two or more unless otherwise specified.

In the related art, two parallel convolutional neural networks (one slow channel and one fast channel) are applied to the same video segment for processing, the slow channel analyzes static content in the video, and the fast channel analyzes dynamic content in the video.

Referring to fig. 1, in some embodiments, video classification is performed by:

step S101: performing feature extraction on a video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel;

step S102: and performing feature fusion on all the feature vector sequences and the target feature vectors through a cross attention mechanism to obtain classification vectors, and determining the video category based on the classification vectors.

Fig. 2 shows a system architecture corresponding to the above embodiment.

However, the target feature vectors in the above manner do not embody the association relationship between different target feature vectors, and lack global view information; in addition, the feature fusion performed by the cross attention mechanism cannot effectively extract the key information, and thus it is difficult to accurately determine the video type of the video to be processed from the classification vector.

Based on this, the embodiment of the present application provides a video classification method and apparatus, and the method includes: extracting features of a video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel, and sequencing the target feature vectors based on the sizes of the convolution kernels; for any target feature vector, updating the target feature vector based on a target feature vector adjacent to the target feature vector; and performing feature fusion on all the feature vector sequences of the video to be processed and the updated target feature vector to obtain a classification vector representing the category of the video to be processed.

Fig. 3 shows a system architecture corresponding to the above embodiment.

According to the scheme, after the feature vector sequence and the target feature vector corresponding to each convolution kernel are obtained, the target feature vectors are sequenced based on the size of the convolution kernels, and after the sequencing, each target feature vector is updated based on other target feature vectors adjacent to each target feature vector, so that the updated target feature vectors show the incidence relation among different target feature vectors and contain global view information; and then performing feature fusion on the feature vector sequence and the updated target feature vector to obtain a classification vector capable of accurately representing the category of the video to be processed, and then accurately classifying the video based on the classification vector.

The following describes the technical solutions of the present application and how to solve the above technical problems in detail with reference to the accompanying drawings and specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

An embodiment of the present application provides a second video classification method, which, as shown in fig. 4, may include:

step S401: and extracting features of the video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel, and sequencing the target feature vectors based on the sizes of the convolution kernels.

In this embodiment, convolutions with different convolution kernels (e.g., 3D convolutions) are set, and a smaller convolution kernel should capture smaller tubelets (video objects) and fine-grained motion; larger convolution kernels, which should be larger, capture slowly changing scenes; therefore, the feature extraction is carried out through the convolution of different convolution kernels, and comprehensive feature information is obtained.

Step S402: for any target feature vector, updating the target feature vector based on a target feature vector adjacent to the target feature vector.

Based on the above, the target feature vectors are sorted according to the size of the convolution kernel, and then the association between adjacent target feature vectors is synthesized, so as to update the target feature vectors, the updated target feature vectors represent the association between different target feature vectors, and include global view information, and then the classification vectors representing the category of the video to be processed more accurately can be obtained according to the updated target feature vectors.

Step S403: and performing feature fusion on all the feature vector sequences of the video to be processed and the updated target feature vector to obtain a classification vector representing the category of the video to be processed.

According to the scheme, after the feature vector sequence and the target feature vector corresponding to each convolution kernel are obtained, the target feature vectors are sequenced based on the size of the convolution kernels, and after the sequencing, each target feature vector is updated based on other target feature vectors adjacent to each target feature vector, so that the updated target feature vectors show the incidence relation among different target feature vectors and contain global view information; and then carrying out feature fusion on the feature vector sequence and the updated target feature vector to obtain a classification vector capable of accurately representing the category of the video to be processed, and then accurately carrying out video classification based on the classification vector.

In some alternative embodiments, the above feature vector sequence and the target feature vector determination method may be as shown in fig. 5:

step S501: and performing feature extraction on the video to be processed through the convolution of the convolution kernel aiming at any convolution kernel to obtain a plurality of multidimensional matrixes corresponding to the convolution kernel.

Illustratively, the video to be processed is represented as V ∈ R ^T×H×W×C Wherein T is the number of image frames in the video, C is the number of channels of each image frame in the video, H is the height, and W is the width. Respectively inputting the video to be processed into each convolution to obtain N multidimensional matrixes output by each convolution, wherein the dimensionality of most matrixes is t multiplied by h multiplied by w, and the multidimensional matrixes are expressed as z epsilon R ^{N×t×h×w×C} Wherein, in the process,

step S502: and respectively carrying out linear transformation on the multi-dimensional matrixes to obtain the characteristic vector sequence.

In this embodiment, each multidimensional matrix is linearly transformed to obtain a one-dimensional matrix, and the eigenvector sequence is formed.

Step S503: and inputting the characteristic vector sequence and a preset vector into an encoder to obtain a target characteristic vector corresponding to the convolution kernel output by the encoder.

This embodiment is applied to feature vector sequence (token) for more equitable fusion of information in feature vector sequence ₁ 、token ₂ 、……token _N ) Adding a learnable preset vector (token) in front _CLS ) And finally, adding position embedding.

Will token _CLS 、token ₁ 、token ₂ 、……token _N Inputting into an encoder to obtain token _CLS ', turn the token _CLS ' as the target feature vector to which the convolution kernel corresponds.

Since the self-Attention mechanism has a square complexity and is computationally difficult to jointly process all the vector sequences, the encoder may employ a Multi-view encoder (transform), which is composed of Multi-head Attention (MSA), normalization (LN), and a Multi-Layer Perceptron (MLP).

Illustratively, a separate encoder (consisting of L transform layers) is used for each set of vectors (the sequence of feature vectors and the default vector). The conversion formula from the j < th > layer to the j +1 < th > layer in the ith vector in the Transformer is as follows:

y ^i,j ＝MSA(LN(z ^i,j ))+z ^i,j

z ^i,j+1 ＝MLP(LN(y ^i,j ))+y ^i,j

after the view is processed by the Transformer, token is processed _CLS Corresponding vector (token) _CLS ') as the target feature vector.

Correspondingly, the embodiment of the present application provides a third video classification method, as shown in fig. 6, the method may include:

step S601: and aiming at any convolution kernel, performing feature extraction on the to-be-processed video through the convolution of the convolution kernel to obtain a plurality of multidimensional matrixes corresponding to the convolution kernel.

Step S602: and respectively carrying out linear transformation on the multi-dimensional matrixes to obtain the characteristic vector sequence.

Step S603: and inputting the characteristic vector sequence and a preset vector into an encoder to obtain a target characteristic vector corresponding to the convolution kernel output by the encoder.

Step S604: the target feature vectors are sorted based on the size of the convolution kernel.

Step S605: and for any target feature vector, updating the target feature vector based on the target feature vector adjacent to the target feature vector.

Step S606: and performing feature fusion on all the feature vector sequences of the video to be processed and the updated target feature vector to obtain a classification vector representing the category of the video to be processed.

The specific implementation manner of steps S601 to S606 can refer to the above embodiments, and is not described herein again.

The embodiment of the present application provides a fourth video classification method, as shown in fig. 7, the method may include:

step S701: and extracting features of the video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel, and sequencing the target feature vectors based on the sizes of the convolution kernels.

The specific implementation manner of step S701 may refer to the above embodiments, and details are not described here.

Step S702: and inputting all target characteristic vectors of the video to be processed into an updating model, and updating the target characteristic vectors based on the target characteristic vectors adjacent to any target characteristic vector through the updating model.

In this embodiment, the updated model is obtained by training the model and learning the association between adjacent target feature vectors; and then, each target feature vector is accurately updated based on the target feature vectors adjacent to each target feature vector through the updating model.

Step S703: and performing feature fusion on all the feature vector sequences of the video to be processed and the updated target feature vector to obtain a classification vector representing the category of the video to be processed.

The specific implementation manner of step S703 may refer to the above embodiments, and is not described herein again.

According to the scheme, the target characteristic vectors are accurately updated through the updating model based on the target characteristic vectors adjacent to the target characteristic vectors, so that the updated target characteristic vectors reflect the incidence relation among different target characteristic vectors.

In some alternative embodiments, the above target feature vector updating method may refer to fig. 8:

step S801: and carrying out average pooling operation on the first characteristic vector of the kth level and the second characteristic vector of the kth level through the updating model to obtain an average vector of the kth level.

K is more than or equal to 1 and less than or equal to K, and K is the total number of the layers of the updating model in iterative updating; the first feature vector of level 1 is any target feature vector, and the second feature vector of level 1 is the adjacent target feature vector.

In this embodiment, the update model is provided with K update layers, that is, it needs to be updated iteratively K times.

Illustratively, there are a total of X target feature vectors; for the first target feature vector, the k-th level average vector z _avg ^1,k ＝avg(z _rep ^1,k ,z _rep ^2,k )；

Aiming at the X < th > target characteristic vector (X is more than or equal to 1 and less than or equal to X-1), the average vector z of the k level _avg ^x,k ＝avg(z _rep ^x-1,k ,z _rep ^x,k ,z _rep ^x+1,k )；

For the Xth target feature vector, the k level average vector z _avg ^X,k ＝avg(z _rep ^X-1,k ,z _rep ^X,k )；

avg is the average pooling calculation, z _rep ^x,k Is the k-th levelx first eigenvectors.

It can be understood that the second feature vector corresponds to the adjacent target feature vector, for the xth target feature vector, the second feature vector at level 1 is xth +1 target feature vector and/or xth-1 target feature vector, and the second feature vectors at other levels are xth +1 first feature vector and/or xth-1 first feature vector calculated by the above method.

Step S802: and performing full-connection layer calculation on the average vector of the kth level to obtain an adjustment vector of the kth level.

After the average vector is determined, full-join calculation is performed through a full-join layer, and an adjustment vector at the layer is determined.

Step S803: determining a sum of the adjustment vector of the k-th level and the first feature vector of the k-th level as a first feature vector of a (k + 1) -th level.

And the first feature vector of the K level is the updated target feature vector.

Illustratively, for the xth target feature vector (1 ≦ X ≦ X-1), the first feature vector z of the (k + 1) th level _avg ^x ^,k+1 ＝△z ^x,k +z _avg ^x,k (ii) a Wherein, Δ z ^x,k For the adjustment vector of the x-th target feature vector at the k-th level, z _avg ^x,k Is the x-th first feature vector.

Correspondingly, a fifth video classification method is provided in an embodiment of the present application, and as shown in fig. 9, the method may include:

step S901: and extracting features of the video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel, and sequencing the target feature vectors based on the sizes of the convolution kernels.

Step S902: and carrying out average pooling operation on the first characteristic vector of the kth level and the second characteristic vector of the kth level through the updating model to obtain an average vector of the kth level.

Step S903: and performing full-connection layer calculation on the average vector of the kth level to obtain an adjustment vector of the kth level.

Step S904: determining the sum of the adjustment vector of the k level and the first feature vector of the k level as a first feature vector of a (k + 1) th level; and the first feature vector of the K level is the updated target feature vector.

Step S905: and performing feature fusion on all the feature vector sequences of the video to be processed and the updated target feature vector to obtain a classification vector representing the category of the video to be processed.

The specific implementation manner of steps S901 to S905 can refer to the above embodiments, and will not be described herein.

In the scheme, for each iteration update of the update model, the average pooling operation is performed on the first feature vector of each level and the second feature vector (adjacent feature vector) of the level, so that the average vector of the target feature vector at the level is obtained; then, calculating through a full connection layer to obtain an adjustment vector of the target characteristic vector at the level; and adjusting the first feature vector of the level based on the adjustment vector of the target feature vector at the level to obtain the first feature vector of the next level, polymerizing more information of other target feature vectors after each layer of updating iteration, and obtaining the updated target feature vector after multiple times of iteration updating.

In some alternative embodiments, the adjustment vector determination method of the k-th level can be seen from fig. 10:

step S1001: performing first full-connected layer calculation on the average vector of the kth level to obtain a first vector of the kth level; and performing second full-connected layer calculation on the average vector of the kth level to obtain a second vector of the kth level, and performing normalization calculation on the second vector of the kth level to obtain weight information of the kth level.

In implementation, the average vector of the kth level is respectively input into two branches, the first branch uses a full connection layer, and the average vector of the kth level is subjected to first full connection layer calculation to obtain a first vector of the kth level; the second branch uses a full connection layer and a normalization (SoftMax) layer, the full connection layer carries out second full connection layer calculation on the average vector of the kth level to obtain a second vector of the kth level, and the normalization layer carries out normalization calculation on the second vector of the kth level to obtain weight information of the kth level.

Step S1002: and obtaining an adjustment vector of the kth level based on the first vector of the kth level and the weight information of the kth level.

Illustratively, the k-th level average vector includes Y eigenvalues, and the weight information includes a weight value corresponding to each eigenvalue. Multiplying each characteristic value in the k-th level average vector by a corresponding weight value to obtain an adjusting value corresponding to the characteristic value; the Y adjustment values constitute the adjustment vector.

According to the scheme, a first vector is obtained through full-connection calculation; obtaining weight information through full-connection calculation and normalization calculation; and determining an adjustment vector representing the key information of the target feature based on the first vector and the weight information, so that the adjustment of the first feature vector based on the adjustment vector can aggregate more information of other target feature vectors and retain the key information of the target feature vectors.

Referring to fig. 11, a schematic diagram of a process for determining an adjustment vector is shown.

In some alternative embodiments, the feature fusion method described above can be seen in fig. 12:

step S1201: and splicing all the characteristic vector sequences of the video to be processed and the updated target characteristic vector to obtain an initial characteristic matrix.

In this embodiment, S vectors of 1 × D dimensions (S is the total number of the feature vector sequence of the video to be processed and the updated target feature vector) are spliced to obtain an initial feature matrix F, which is represented as F ∈ i ^S×D 。

Step S1202: inputting the initial characteristic matrix into a fusion model, and performing characteristic fusion on the initial characteristic matrix through the fusion model to obtain a classification vector representing the category of the video to be processed.

Training the model, and learning the correlation between the initial characteristic matrix and the classification vector to obtain the fusion model; and then, performing feature fusion on the initial feature matrix through the fusion model, and accurately determining the classification vector.

Correspondingly, a sixth video classification method is provided in an embodiment of the present application, and as shown in fig. 13, the method may include:

step S1301: and extracting features of the video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel, and sequencing the target feature vectors based on the sizes of the convolution kernels.

Step S1302: for any target feature vector, updating the target feature vector based on a target feature vector adjacent to the target feature vector.

Step S1303: and splicing all the characteristic vector sequences of the video to be processed and the updated target characteristic vector to obtain an initial characteristic matrix.

Step S1304: inputting the initial characteristic matrix into a fusion model, and performing characteristic fusion on the initial characteristic matrix through the fusion model to obtain a classification vector representing the category of the video to be processed.

The specific implementation manner of steps S1301 to S1304 may refer to the above embodiments, and details are not described herein.

According to the scheme, the initial characteristic matrix is subjected to characteristic fusion through the fusion model, redundant information in the characteristic vector is effectively removed, and key information of the redundant information is reserved. And the video classification precision is improved.

In some alternative embodiments, the feature fusion method described above can be seen in FIG. 14:

step S1401: and splicing all the characteristic vector sequences of the video to be processed and the updated target characteristic vector to obtain an initial characteristic matrix.

The specific implementation manner of step S1401 may refer to the above embodiments, and is not described herein again.

Step S1402: inputting the initial feature matrix into a fusion model, and determining an update matrix of the m-th level through the fusion model based on the adjacent matrix of the m-th level, the feature matrix of the m-th level and the adjustment parameter of the m-th level.

Wherein M is more than or equal to 1 and less than or equal to M, and M is the total number of levels of the iterative fusion of the fusion model; if M is more than or equal to 2 and less than or equal to M, the M-th level adjacency matrix is determined based on the M-1-th level update matrix and the M-1-th level adjacency matrix, the 1-th level adjacency matrix is a preset matrix, and the 1-th level feature matrix is the initial feature matrix.

In some optional embodiments, the fusion model is Graph Convolutional Networks (GCN).

Illustratively, the initial feature matrix is represented as F ∈ i ^S×D The predetermined matrix is expressed as A ∈ i ^S×S ；

Update matrix U of m-th level _m ＝SoftMax[GCN(A _m ,F _m )](ii) a For example:

i is a row of the matrix A and j is a column of the matrix A;

wherein Um is in the form of R ^Sm×Sm+1 (ii) a σ is an activation function, A _m Is an m-th level of adjacency matrix, A _m ∈R ^Sm×Sm ，A _m ＝U _m-1 ^T ×A _m-1 ×U _m-1 ；F _m Is a feature matrix of the m-th level, F _m ∈R ^Sm×Dm ，w _m Is an adjustment parameter of the m-th level, w _m ∈R ^Dm×Sm+1 。

Step S1403: determining a product of an inverse matrix of the update matrix of the mth level and the feature matrix of the mth level as a feature matrix of an m +1 th level; wherein the feature matrix of the M-th level is the classification vector.

Exemplary, F _m+1 ＝U _m ^T ×F _m ，F _m+1 ∈R ^Sm+1×Dm 。

Referring to fig. 15, the number of vectors (nodes) in the feature matrix is continuously decreased by updating the matrix until the number of nodes is 1.

Fig. 15 is only an exemplary illustration of the variation of the number of nodes in the feature matrix, and the embodiment is not limited thereto.

Correspondingly, the present application provides a seventh video classification method, as shown in fig. 16, the method may include:

step S1601: and extracting features of the video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel, and sequencing the target feature vectors based on the sizes of the convolution kernels.

Step S1602: and for any target feature vector, updating the target feature vector based on the target feature vector adjacent to the target feature vector.

Step S1603: and splicing all the characteristic vector sequences of the video to be processed and the updated target characteristic vector to obtain an initial characteristic matrix.

Step S1604: inputting the initial feature matrix into a fusion model, and determining an update matrix of the m-th level through the fusion model based on the adjacent matrix of the m-th level, the feature matrix of the m-th level and the adjustment parameter of the m-th level.

Step S1605: determining a product of an inverse matrix of the update matrix of the mth level and the feature matrix of the mth level as a feature matrix of an m +1 th level; wherein the feature matrix of the M-th level is the classification vector.

The specific implementation of steps S1601 to S1605 can refer to the above embodiments, and will not be described herein.

In some optional embodiments, the present application provides an eighth video classification method, as shown in fig. 17, the method may include:

step S1701: and extracting features of the video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel, and sequencing the target feature vectors based on the sizes of the convolution kernels.

Step S1702: and for any target feature vector, updating the target feature vector based on the target feature vector adjacent to the target feature vector.

Step S1703: and performing feature fusion on all the feature vector sequences of the video to be processed and the updated target feature vector to obtain a classification vector representing the category of the video to be processed.

The specific implementation of steps S1701 to S1703 can refer to the above embodiments, and will not be described herein.

Step S1704: determining a video category corresponding to the classification vector of the video to be processed based on a preset corresponding relation; the preset corresponding relation comprises a corresponding relation between a classification vector of the video and a video category.

The classification vector of the video to be processed represents the category of the video to be processed, and the video category of the video to be processed can be determined according to the corresponding relationship by presetting the corresponding relationship between the classification vector of the video and the video category.

According to the scheme, the video category corresponding to the classification vector of the video to be processed (namely the category to which the video to be processed belongs) can be accurately and efficiently determined based on the preset corresponding relation.

As shown in fig. 18, based on the same inventive concept, an embodiment of the present application provides a video classification apparatus 1800, including:

a feature extraction module 1801, configured to perform feature extraction on a video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel, and rank the target feature vectors based on the size of the convolution kernels;

an updating module 1802, configured to update, for any target feature vector, a target feature vector that is adjacent to the target feature vector;

a fusion module 1803, configured to perform feature fusion on all the feature vector sequences of the video to be processed and the updated target feature vector, so as to obtain a classification vector representing the category of the video to be processed.

In some alternative embodiments, the update module 1802 is specifically configured to:

and inputting all target characteristic vectors of the video to be processed into an updating model, and updating the target characteristic vectors based on the target characteristic vectors adjacent to any target characteristic vector through the updating model.

In some optional embodiments, the update module 1802 is specifically configured to:

carrying out average pooling operation on the first characteristic vector of the kth level and the second characteristic vector of the kth level through the updating model to obtain an average vector of the kth level; k is more than or equal to 1 and less than or equal to K, and K is the total number of the levels of the iterative updating of the updating model; the first feature vector of level 1 is any target feature vector, and the second feature vector of level 1 is the adjacent target feature vector;

performing full-connection layer calculation on the average vector of the kth level to obtain an adjustment vector of the kth level;

determining the sum of the adjustment vector of the k level and the first feature vector of the k level as a first feature vector of a (k + 1) th level; and the first feature vector of the K level is the updated target feature vector.

performing first full-connected layer calculation on the average vector of the kth level to obtain a first vector of the kth level; performing second full-link layer calculation on the average vector of the kth level to obtain a second vector of the kth level, and performing normalization calculation on the second vector of the kth level to obtain weight information of the kth level;

and obtaining an adjustment vector of the kth level based on the first vector of the kth level and the weight information of the kth level.

In some optional embodiments, the fusion module 1803 is specifically configured to:

splicing all the characteristic vector sequences of the video to be processed and the updated target characteristic vector to obtain an initial characteristic matrix;

inputting the initial characteristic matrix into a fusion model, and performing characteristic fusion on the initial characteristic matrix through the fusion model to obtain a classification vector representing the category of the video to be processed.

determining an m-level update matrix based on the m-level adjacency matrix, the m-level feature matrix and the m-level adjustment parameter through the fusion model; wherein M is more than or equal to 1 and less than or equal to M, and M is the total number of levels of the iterative fusion of the fusion model; if M is more than or equal to 2 and less than or equal to M, the M-level adjacency matrix is determined based on the M-1-level update matrix and the M-1-level adjacency matrix, the 1-level adjacency matrix is a preset matrix, and the 1-level feature matrix is the initial feature matrix;

determining a product of an inverse matrix of the update matrix of the mth level and the feature matrix of the mth level as a feature matrix of an m +1 th level; wherein the feature matrix of the M-th level is the classification vector.

In some optional embodiments, the feature extraction module 1801 is specifically configured to:

performing feature extraction on the video to be processed by convolution of the convolution kernel aiming at any convolution kernel to obtain a plurality of multidimensional matrixes corresponding to the convolution kernel;

respectively carrying out linear transformation on the multi-dimensional matrixes to obtain the characteristic vector sequence;

and inputting the characteristic vector sequence and a preset vector into an encoder to obtain a target characteristic vector corresponding to the convolution kernel output by the encoder.

Referring to fig. 19, in some alternative embodiments, another video classification apparatus 1900 is provided in this embodiment, which further includes a classification module 1804, configured to, based on the video classification apparatus 1800:

after the fusion module 1803 obtains the classification vector representing the category of the video to be processed, based on a preset corresponding relationship, determining the video category corresponding to the classification vector of the video to be processed; the preset corresponding relation comprises a corresponding relation between a classification vector of the video and a video category.

Since the apparatus is the apparatus in the method in the embodiment of the present application, and the principle of the apparatus for solving the problem is similar to that of the method, the implementation of the apparatus may refer to the implementation of the method, and repeated descriptions are omitted.

As shown in fig. 20, based on the same inventive concept, an embodiment of the present application provides an electronic device 2000, including: a processor 2001 and memory 2002;

the memory 2002 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 2002 may also be a non-volatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a solid-state drive (SSD); or the memory 2002 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 2002 may be a combination of the above.

Processor 2001, which may include one or more Central Processing Units (CPUs), graphics Processing Units (GPUs), or digital Processing units, among others.

The specific connection medium between the memory 2002 and the processor 2001 is not limited in this embodiment. In fig. 20, the memory 2002 and the processor 2001 are connected by a bus 2003, the bus 2003 is indicated by a thick line in fig. 20, and the bus 2003 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 20, but this is not intended to represent only one bus or type of bus.

Wherein the memory 2002 stores program code that, when executed by the processor 2001, causes the processor 2001 to perform the following:

for any target feature vector, updating the target feature vector based on the target feature vector adjacent to the target feature vector;

In some alternative embodiments, the processor 2001 specifically performs:

carrying out average pooling operation on the first characteristic vector of the kth level and the second characteristic vector of the kth level through the updating model to obtain an average vector of the kth level; k is more than or equal to 1 and less than or equal to K, and K is the total number of the layers of the updating model in iterative updating; the first feature vector of level 1 is any target feature vector, and the second feature vector of level 1 is the adjacent target feature vector;

In some alternative embodiments, the processor 2001 performs in particular:

performing first full-link layer calculation on the average vector of the kth level to obtain a first vector of the kth level; performing second full-link layer calculation on the average vector of the kth level to obtain a second vector of the kth level, and performing normalization calculation on the second vector of the kth level to obtain weight information of the kth level;

and obtaining an adjustment vector of the k level based on the first vector of the k level and the weight information of the k level.

In some alternative embodiments, the processor 2001 specifically performs:

splicing all the feature vector sequences of the video to be processed and the updated target feature vector to obtain an initial feature matrix;

In some alternative embodiments, the processor 2001 specifically performs:

In some optional embodiments, the processor 2001 further performs, after obtaining the classification vector characterizing the to-be-processed video category:

determining a video category corresponding to the classification vector of the video to be processed based on a preset corresponding relation; the preset corresponding relation comprises a corresponding relation between a video classification vector and a video category.

Since the electronic device is an electronic device for executing the method in the embodiment of the present application, and the principle of the electronic device for solving the problem is similar to that of the method, reference may be made to the implementation of the method for the implementation of the electronic device, and repeated details are not described again.

Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the video classification method as described above. The readable storage medium may be a nonvolatile readable storage medium, among others.

The present application is described above with reference to block diagrams and/or flowchart illustrations of methods, apparatus (systems) and/or computer program products according to embodiments of the application. It will be understood that one block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, and/or other programmable apparatus to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable apparatus, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.

Accordingly, the present application may also be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.). Furthermore, the present application may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of this application, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

While the preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for video classification, the method comprising:

2. The method of claim 1, wherein updating the target feature vector based on target feature vectors neighboring the target feature vector for any target feature vector comprises:

3. The method of claim 2, wherein updating the target feature vector based on target feature vectors adjacent to any target feature vector by the update model comprises:

carrying out average pooling operation on the first characteristic vector of the kth level and the second characteristic vector of the kth level through the updating model to obtain an average vector of the kth level; k is more than or equal to 1 and less than or equal to K, and K is the total number of the layers of the updating model in iterative updating; the first feature vector of the level 1 is any one of the target feature vectors, and the second feature vector of the level 1 is a target feature vector adjacent to the target feature vector;

4. The method of claim 3, wherein performing a full-connected layer calculation on the k-th level average vector to obtain a k-th level adjustment vector comprises:

performing first full-link layer calculation on the average vector of the kth level to obtain a first vector of the kth level; performing second full-connected layer calculation on the average vector of the kth level to obtain a second vector of the kth level, and performing normalization calculation on the second vector of the kth level to obtain weight information of the kth level;

5. The method according to claim 1, wherein performing feature fusion on all feature vector sequences of the video to be processed and the updated target feature vector to obtain a classification vector representing the category of the video to be processed comprises:

and inputting the initial characteristic matrix into a fusion model, and performing characteristic fusion on the initial characteristic matrix through the fusion model to obtain a classification vector representing the category of the video to be processed.

6. The method of claim 5, wherein feature fusing the initial feature matrix by the fusion model comprises:

determining an m-th level update matrix based on the m-th level adjacency matrix, the m-th level feature matrix and the m-th level adjustment parameter through the fusion model; wherein M is more than or equal to 1 and less than or equal to M, and M is the total number of the layers of the iterative fusion of the fusion model; if M is more than or equal to 2 and less than or equal to M, the M-level adjacency matrix is determined based on the M-1-level update matrix and the M-1-level adjacency matrix, the 1-level adjacency matrix is a preset matrix, and the 1-level feature matrix is the initial feature matrix;

7. The method of claim 1, wherein extracting features of the video to be processed by convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel comprises:

performing feature extraction on the video to be processed through the convolution of the convolution kernel aiming at any convolution kernel to obtain a plurality of multidimensional matrixes corresponding to the convolution kernel;

8. The method according to any one of claims 1 to 7, further comprising, after obtaining the classification vector characterizing the video category to be processed:

determining a video category corresponding to the classification vector of the video to be processed based on a preset corresponding relation; the preset corresponding relation comprises a corresponding relation between a classification vector of the video and a video category.

9. A video classification apparatus, characterized in that the apparatus comprises:

the feature extraction module is used for extracting features of the video to be processed through convolution of different convolution kernels to obtain a feature vector sequence and a target feature vector corresponding to each convolution kernel, and sequencing the target feature vectors based on the sizes of the convolution kernels;

the updating module is used for updating the target characteristic vector aiming at any target characteristic vector based on the target characteristic vector adjacent to the target characteristic vector;

10. The apparatus of claim 9, wherein the update module is specifically configured to: