CN110096617B

CN110096617B - Video classification method and device, electronic equipment and computer-readable storage medium

Info

Publication number: CN110096617B
Application number: CN201910357559.2A
Authority: CN
Inventors: 龙翔; 何栋梁; 李甫; 迟至真; 周志超; 赵翔; 李鑫; 文石磊; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2021-08-10
Anticipated expiration: 2039-04-29
Also published as: CN110096617A

Abstract

The invention provides a video classification method, a video classification device, electronic equipment and a computer-readable storage medium. The method comprises the following steps: obtaining a first characteristic sequence of a video to be classified; wherein the features in the first feature sequence are arranged in time order; inputting the first characteristic sequence into a target pyramid attention network to obtain a first output result output by the target pyramid attention network; obtaining a target vector according to the first output result; and classifying the video to be classified according to the target vector. Compared with the prior art, the embodiment of the invention can effectively improve the video classification efficiency, and the target pyramid attention network adopts an attention type method which can extract and fuse the most effective characteristics of the video for video classification, so that the accuracy of the classification result can be better ensured.

Description

Video classification method and device, electronic equipment and computer-readable storage medium

Technical Field

The embodiment of the invention relates to the technical field of video classification, in particular to a video classification method, a video classification device, electronic equipment and a computer-readable storage medium.

Background

Video classification is one of the most important and basic tasks in computer vision, and is to classify videos into pre-defined categories by analyzing and understanding related information of the videos, wherein the video classification plays a key role in application scenes such as video search and video recommendation, and is also an important dependence on video technologies such as video tagging, video monitoring and video title generation.

At present, the common video classification methods are: and directly inputting all frames of the video into equipment for video classification to obtain a classification result output by the equipment. In this way, all frames of the video need to be analyzed, and the classification efficiency of the video is very low.

Disclosure of Invention

The embodiment of the invention provides a video classification method, a video classification device, electronic equipment and a computer readable storage medium, and aims to solve the problem of low classification efficiency of the existing video classification mode.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a video classification method, where the method includes:

obtaining a first characteristic sequence of a video to be classified; wherein the features in the first feature sequence are arranged in chronological order;

inputting the first characteristic sequence into a target pyramid attention network to obtain a first output result output by the target pyramid attention network;

obtaining a target vector according to the first output result;

and classifying the videos to be classified according to the target vectors.

In a second aspect, an embodiment of the present invention provides a video classification apparatus, where the apparatus includes:

the first obtaining module is used for obtaining a first characteristic sequence of the video to be classified; wherein the features in the first feature sequence are arranged in chronological order;

a second obtaining module, configured to input the first feature sequence into a target pyramid attention network, and obtain a first output result output by the target pyramid attention network;

a third obtaining module, configured to obtain a target vector according to the first output result;

and the classification module is used for classifying the video to be classified according to the target vector.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the steps of the video classification method described above.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the video classification method described above.

In the embodiment of the invention, in order to realize the classification of the video, the first feature sequence of the video to be classified can be input into the target pyramid attention network, then the target vector is obtained according to the first output result output by the target pyramid attention network, and finally the video to be classified can be classified according to the target vector. Therefore, in the embodiment of the invention, the classification of the video can be realized by utilizing the first feature sequence of the video to be classified and the target pyramid attention network, so that compared with the situation that all frames in the video to be classified are required to be analyzed in the prior art, the classification efficiency of the video can be effectively improved, and the target pyramid attention network adopts an attention type method which can extract and fuse the most effective features of the video for video classification, so that the accuracy of the classification result can be better ensured.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of a video classification method provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a video classification method according to an embodiment of the present invention;

fig. 3 is a sequence diagram of the use of the charging plug;

FIG. 4 is a second schematic diagram of a video classification method according to an embodiment of the present invention;

fig. 5 is a third schematic diagram of a video classification method according to an embodiment of the present invention;

fig. 6 is a block diagram of a video classification apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

First, a video classification method provided by an embodiment of the present invention is described below.

It should be noted that the video classification method provided by the embodiment of the present invention is applied to electronic devices. Specifically, the electronic device may be a server, of course, the type of the electronic device is not limited to the server, and the electronic device may also be other types of devices that can be used for video classification.

Referring to fig. 1, a flowchart of a video classification method according to an embodiment of the present invention is shown. As shown in fig. 1, the method comprises the steps of:

step 101, obtaining a first characteristic sequence of a video to be classified; wherein the features in the first sequence of features are arranged in chronological order.

In step 101, the electronic device may extract key features of a video by using a model based on a Convolutional Neural Network (CNN) to obtain a first feature sequence of the video to be classified; wherein, the features in the first feature sequence can be arranged in the order of time from early to late. It is understood that CNN is one of a class of feed Forward Neural Networks (FNN) that includes convolution calculations and has a deep structure, and that CNN is also one of the representative algorithms of deep learning (deep learning).

And 102, inputting the first characteristic sequence into the target pyramid attention network to obtain a first output result output by the target pyramid attention network.

Here, only one type of pyramid attention network may be included in the target pyramid attention network, e.g., only a temporal pyramid attention network or a channel pyramid attention network may be included in the target pyramid attention network; alternatively, at least two types of pyramid attention networks may be included in the target pyramid attention network, for example, both the temporal pyramid attention network and the channel pyramid attention network may be included in the target pyramid attention network.

If the target pyramid attention network includes at least two types of pyramid attention networks, in step 102, the first feature sequence may be respectively input into each type of pyramid attention network to respectively obtain a first output result output by each type of pyramid attention network, and subsequent step 103 may be performed according to the first output result of each type of pyramid attention network.

And 103, acquiring a target vector according to the first output result.

Here, the target vector is a vector that can represent the features of the entire video to be classified. It should be noted that, according to the first output result, the specific implementation form of obtaining the target vector is various, and for clarity of layout, the following description is made by way of example.

And 104, classifying the video to be classified according to the target vector.

It should be noted that there may be K video categories, which are in turn B, involved in the embodiment of the present invention₁、B₂、……、B_K(ii) a Wherein K is an integer greater than 1. After the video to be classified is classified according to the target vector, the classification result obtained by the electronic device may include K probability values, which are G in turn₁、G₂、……、G_K(ii) a Wherein G is₁For video to be classified as belonging to B₁Probability value of this video category, G₂For video to be classified as belonging to B₂Probability value of this video category, … …, G_KFor video to be classified as belonging to B_KProbability values for such video categories.

In step 104, if an adaptation of a single label is performed when video classification is to be classified, then G₁、G₂、……、G_KThe sum of the K probability values is 1; if, during the video classification to be classified, a multi-label adaptation is carried out, then G₁、G₂、……、G_KThe sum of the K probability values may or may not be 1.

Optionally, classifying the video to be classified according to the target vector, including:

and inputting the target vector into the full-connection network to obtain a classification result of the video to be classified output by the full-connection network.

Here, a fully connected network may be considered as a pre-trained classification model stored locally in the electronic device; the classification model may be obtained by training a large number of video types as output and using target vectors of the large number of videos as input. Specifically, the classification model may be trained by the electronic device itself; alternatively, the classification model may be trained by other devices and then distributed to the electronic devices.

In this embodiment, the classification result of the video to be classified can be obtained only by inputting the target vector into the full-connection network, and therefore, the operation of obtaining the classification result of the video to be classified is very convenient to implement.

Optionally, the target pyramid attention network is a temporal pyramid attention network;

the first output result comprises M feature sequence sets with different time scales, each feature sequence set is composed of second feature sequences obtained by dividing the first feature sequences according to the corresponding time scales, the second feature sequences in each feature sequence set are arranged according to the time sequence, the features in each second feature sequence are arranged according to the time sequence, and M is an integer larger than 1.

Here, the value of M may be 2, 3, 4, 5, 6, or an integer greater than 6, which is not listed here. In addition, the number of second feature sequences in each feature sequence set may be different due to the different time scales of the feature sequence sets.

Suppose that the first feature sequence of the video to be classified is X in FIG. 2₍₁₎₁，X₍₁₎₁Including features x arranged in time sequence₁Characteristic x₂Characteristic x₃Characteristic x₄Characteristic x₅Characteristic x₆Characteristic x₇And feature x₈In the reaction of X₍₁₎₁After the temporal pyramid attention network is input, the first output result of the temporal pyramid attention network output may include 3 feature sequence sets. That is, M takes a value of 3, and at this time, it can be considered thatThe pyramid level of the time pyramid attention network is 3 levels, such as level 1, level 2, and level3 in fig. 2, where level 1, level 2, and level3 may respectively correspond to feature sequence sets of different time scales.

Specifically, the feature sequence set corresponding to level 1 may be formed by a second feature sequence, X₍₁₎As this second signature sequence. The feature sequence set corresponding to level 2 may be represented by X₍₂₎₁And X₍₂₎₂The two second characteristic sequences are formed; wherein, X₍₂₎₁Including x arranged in time sequence₁、x₂、x₃And x₄，X₍₂₎₂Including x arranged in time sequence₅、x₆、x₇And x₈(ii) a The feature sequence set corresponding to level3 may be represented by X₍₃₎₁、X₍₃₎₂、X₍₃₎₃And X₍₃₎₄The four second characteristic sequences are formed; wherein, X₍₃₎₁Including x arranged in time sequence₁And x₂，X₍₃₎₂Including x arranged in time sequence₃And x₄，X₍₃₎₃Including x arranged in time sequence₅And x₆，X₍₃₎₄Including x arranged in time sequence₇And x₈。

It is easy to see that the feature sequence set corresponding to level 1 is X₍₁₎₁The characteristic sequence set corresponding to level 2 is obtained by dividing time into one part₍₁₎₁Obtained by dividing time into two equal parts, and the characteristic sequence set corresponding to level3 is obtained by dividing X₍₁₎₁Obtained by dividing the time into four equal parts.

In this way, the electronic device can obtain a first output result including the feature sequence set corresponding to level 1, the feature sequence set corresponding to level 2, and the feature sequence set corresponding to level 3. Then, a target vector can be obtained according to the first output result, and classification of the video to be classified is realized according to the target vector.

It should be noted that, when classifying videos, if the video's chronological order is not considered at all, all features are put in an unordered set, all key features are treated equally in one group, and the chronological association between features is ignored completely, which is effective in some scenarios but not effective in other scenarios. For example, as shown in fig. 3, if all the key features are treated out of order in one group, it is impossible to distinguish whether the user operates to insert or extract the charging plug into or from the outlet.

In view of this, in this embodiment, a time pyramid attention network may be used, the first feature sequence of the video to be classified is first divided into a plurality of second feature sequences under a plurality of time scales, and then a feature sequence set of the plurality of time scales is obtained, the second feature sequences in the feature sequence set are all arranged according to a time sequence, and features in the second feature sequences are also arranged according to the time sequence, so that a time sequence may be introduced into an unordered attention mechanism, so as to effectively solve the video classification problem with strong time sequence dependency. It can be seen that the present embodiment is not only suitable for video classification in a weak timing dependence scene, but also suitable for video classification in a strong timing dependence scene.

Optionally, the larger the value of M, the longer the video duration of the video to be classified.

Specifically, the electronic device may pre-store a corresponding relationship between the video duration range and a value of M; the video duration range of 10 minutes to 15 minutes may correspond to the value of 5, the video duration range of 5 minutes to 10 minutes may correspond to the value of 4, and the video duration range of 0 minutes to 5 minutes may correspond to the value of 3.

Then, in the case that the video duration of the video to be classified is within the video duration range of 10 minutes to 15 minutes, the first output result may include 5 feature sequence sets with different time scales, and at this time, the temporal pyramid of the temporal pyramid attention network has a level of 5. Under the condition that the video duration of the video to be classified is within the video duration range of 0 minute to 5 minutes, the first output result may include 3 feature sequence sets with different time scales, and at this time, the time pyramid of the time pyramid attention network has a level of 3.

As can be seen, in this embodiment, when the time pyramid attention network is used, the hierarchy of the time pyramid is not completely fixed, and the hierarchy of the time pyramid can be flexibly adjusted according to the duration of the video to be classified, so that the hierarchy of the time pyramid is matched with the duration of the video to be classified, thereby ensuring the classification efficiency and the classification effect.

Optionally, obtaining a target vector according to the first output result includes:

inputting each second feature sequence in the first output result into the channel pyramid attention network respectively to obtain a second output result which is output by the channel pyramid attention network respectively and corresponds to each second feature sequence;

obtaining a target vector according to a second output result corresponding to each second characteristic sequence;

the second output result corresponding to any second feature sequence comprises N sub-feature sequence sets with different feature fine granularities, each sub-feature sequence set is composed of sub-feature sequences obtained by dividing a second feature sequence according to the corresponding feature fine granularity, sub-features in each sub-feature sequence are arranged according to a time sequence, and N is an integer larger than 1.

Here, N may be 2, 3, 4, 5, 6, or an integer greater than 6, which is not listed here. In addition, the value of M and the value of N can be the same or different.

Assuming that the channel pyramid attention network is denoted by CPATt, as shown in FIG. 2, the result includes X₍₁₎、X₍₂₎₁、X₍₂₎₂、X₍₃₎₁、X₍₃₎₂、X₍₃₎₃And X₍₃₎₄After the first output results of the 7 second feature sequences, the electronic device can input the 7 second feature sequences into the CPAtt respectively to obtain 7 second output results corresponding to the 7 second feature sequences output by the CPAtt.

It is assumed that one of the 7 second signature sequences can also be represented as X in fig. 4⁽¹⁾¹And X⁽¹⁾¹Including features x arranged in time sequence₁Characteristic x₂… …, feature x_LIn the reaction of X⁽¹⁾¹After the channel pyramid attention network is input, a second output result output by the channel pyramid attention network may include 3 sub-feature sequence sets with different feature fine granularities. That is to say, the value of N is 3, at this time, the pyramid level of the channel pyramid attention network may be considered as 3 layers, for example, level 1, level 2, and level3 in fig. 4, where level 1, level 2, and level3 may respectively correspond to sub-feature sequence sets of different feature fine granularities.

Specifically, the sub-feature sequence set corresponding to level 1 is composed of a sub-feature sequence, and X is⁽¹⁾¹As this sub-signature sequence. The sub-feature sequence set corresponding to level 2 may be represented by X⁽²⁾¹And X⁽²⁾²The two sub-signature sequences; wherein, X⁽²⁾¹Including a reaction of x₁One of the two sub-features obtained by segmentation is x₂One of the two sub-features obtained by the segmentation, … …, is represented by x_LOne of the two sub-features obtained by segmentation, X⁽²⁾²Including a reaction of x₁The other of the two sub-features obtained by the segmentation is x₂The other of the two sub-features obtained by the segmentation, … …, is represented by x_LThe other of the two resulting sub-features is segmented. The set of sub-feature sequences corresponding to level3 may be represented by X⁽³⁾¹、X⁽³⁾²、X⁽³⁾³And X⁽³⁾⁴The four sub-characteristic sequences; wherein, X⁽³⁾¹Including a reaction of x₁The first of the four sub-features obtained by segmentation is represented by x₂The first of the four sub-features obtained by the segmentation, … …, is represented by x_LSegmenting a first one of the four sub-features; x⁽³⁾²Including a reaction of x₁The second of the four sub-features obtained by segmentation is x₂The second of the four sub-features obtained by the segmentation, … …, is represented by x_LThe second of the four sub-features obtained by segmentation, X⁽³⁾³And X⁽³⁾⁴The contents included therein are analogized and will not be described in detail herein.

It should be noted that, the content included in the second output result corresponding to the other second feature sequences may refer to the above description, and is not described herein again. And then, obtaining a target vector according to the second output result corresponding to each second characteristic sequence.

In a specific embodiment, any one of the second output results further includes a weight corresponding to each sub-feature in each sub-feature sequence included in the second output result;

obtaining a target vector according to a second output result corresponding to each second feature sequence, including:

for each sub-feature sequence in each second output result, performing weighted summation according to each sub-feature and corresponding weight in the sub-feature sequence to obtain a corresponding feature vector;

performing splicing operation according to the feature vectors corresponding to all the sub-feature sequences to obtain spliced vectors;

and taking the splicing vector as a target vector.

Specifically, for X described above⁽²⁾¹This sequence of sub-signatures is assumed to include sub-signatures of x in order₁₁、x₂₁、……、x_L1Wherein x is₁₁、x₂₁、……、x_L1Are all in vector form, and x₁₁Corresponding weight is z₁、x₂₁Corresponding weight is z₂、x_L1Corresponding weight is z_LThen, X⁽²⁾¹The corresponding feature vector y can be calculated by adopting the following formula:

y＝x₁₁z₁+x₂₁z₂+……+x_L1z_L

note that, the feature vectors corresponding to other sub-feature sequences are calculated in the manner described above with reference to the pair X⁽²⁾¹The description of this sub-signature sequence is only required, and is not repeated herein. After the feature vectors corresponding to all the sub-feature sequences are obtained, the feature vectors may be subjected to a splicing operation to obtain a spliced vector used as a target vector.

It should be noted that Att in fig. 4 can be regarded as a calculation operation of a feature vector, and fig. 2 and the drawingsThe Contact in 4 can be considered as a vector splicing operation. As shown in fig. 4, X can be paired⁽²⁾¹Corresponding feature vector sum X⁽²⁾²Performing a stitching operation on the corresponding feature vectors to obtain a first stitching vector, e.g. obtaining y in FIG. 4⁽²⁾(ii) a To X⁽³⁾¹Corresponding feature vector, X⁽³⁾²Corresponding feature vector, X⁽³⁾³Corresponding feature vector sum X⁽³⁾⁴The corresponding feature vectors are subjected to a stitching operation to obtain a second stitching vector, e.g. y3 in FIG. 4⁽³⁾. Then, for X⁽¹⁾¹Corresponding feature vector (e.g., y in FIG. 4)⁽¹⁾) And performing splicing operation on the first splicing vector and the second splicing vector to obtain a third splicing vector, wherein the third splicing vector corresponds to a certain second characteristic sequence in the 7 second characteristic sequences.

Then, in a similar manner to the above-described procedure, 6 third splicing vectors corresponding to the other 6 second feature sequences can be obtained, that is, X can be finally obtained₍₁₎、X₍₂₎₁、X₍₂₎₂、X₍₃₎₁、X₍₃₎₂、X₍₃₎₃And X₍₃₎₄These 7 second signature sequences correspond to 7 third stitching vectors. In this case, as shown in FIG. 2, X may be selected₍₂₎₁Corresponding third splicing vector sum X₍₂₎₂Performing a stitching operation on the corresponding third stitching vector to obtain a fourth stitching vector, e.g. obtaining y in FIG. 2₍₂₎(ii) a To X₍₃₎₁Corresponding third splicing vector, X₍₃₎₂Corresponding third splicing vector, X₍₃₎₃Corresponding third splicing vector sum X₍₃₎₄Performing a stitching operation on the corresponding third stitching vector to obtain a fifth stitching vector, e.g. obtaining y in FIG. 2₍₃₎. Then, for X₍₂₎₁Corresponding third stitching vector (e.g., y in FIG. 2)₍₁₎) And performing splicing operation on the fourth splicing vector and the fifth splicing vector to obtain a splicing vector used as a target vector.

It should be noted that, when classifying the video, if the electronic device directly calculates a weight for each feature, the weight of each feature can be directly used for video classification, but in many cases, only a part of channels in the video to be classified contribute to video classification. For example, as shown in fig. 5, the video to be classified may include two video frames, Frame1 and Frame2, both of which contribute to video classification, but the important channels in the two video frames are significantly different, the important channel of Frame1 corresponds to the region enclosed by the rectangular Frame 510, and the important channel of Frame2 corresponds to the region enclosed by the rectangular Frame 520. On the basis of fig. 5, if weights are assigned to the entire features of two video frames respectively, as shown in the lower left corner of fig. 5, only relatively balanced weights can be provided for the two video frames, for example, the weights assigned to the features, i.e., Feature1 and Feature2, may be both 0.5, so that the weight of the irrelevant noise is also 0.5, and the important channels of the two features become weaker after weighted averaging, which may result in lower accuracy of video classification.

In view of this, in this embodiment, a channel pyramid attention network may be used to gradually divide each feature into a plurality of sub-features from coarse to fine, and assign corresponding weights to each sub-feature, so that, as shown in the lower right corner of fig. 5, the weight of the significant portion in each feature may be set to 1.0 and the weight of the insignificant portion may be set to 0.0, for example, the sub-Feature of Feature1, which is in the top half, may be weighted to 1.0, the sub-Feature of Feature1, which is in the bottom half, may be weighted to 0.0, and, the weight of the sub-Feature of Feature2 in the upper half may be set to 0.0, and the weight of the sub-Feature of Feature2 in the lower half may be set to 1.0, so that, after the subsequent weighting operation, the important channel information can be completely reserved, so that a more accurate classification result can be obtained. Therefore, in the embodiment, the accuracy of the classification result can be effectively ensured by using the channel pyramid attention network.

Optionally, the larger the value of N, the longer the video duration of the video to be classified.

Specifically, the electronic device may pre-store a correspondence between the video duration range and the value of N; the video duration range of 10 minutes to 15 minutes may correspond to the value of 5, the video duration range of 5 minutes to 10 minutes may correspond to the value of 4, and the video duration range of 0 minutes to 5 minutes may correspond to the value of 3.

Then, under the condition that the video duration of the video to be classified is within the video duration range of 10 minutes to 15 minutes, each second output result may include 5 sub-feature sequence sets with different feature fine granularities, and at this time, the level of the channel pyramid attention network is 5 levels. Under the condition that the video duration of the video to be classified is within the video duration range of 0 minute to 5 minutes, each second output result may include 3 sub-feature sequence sets with different feature fine granularities, and at this time, the level of the channel pyramid attention network is 3 levels.

It can be seen that, in this embodiment, when the channel pyramid attention network is used, the level of the channel pyramid is not completely fixed and unchanged, and the level of the channel pyramid can be flexibly adjusted according to the length of the video duration of the video to be classified, so that the level of the channel pyramid is matched with the video duration of the video to be classified, and thus the classification efficiency and the classification effect are ensured.

Optionally, the number of the first feature sequences is at least two, and the feature types corresponding to each first feature sequence are different.

Here, the number of the first feature sequences may be two, three, four or more, and is not listed here.

In one embodiment, the at least two first feature sequences may include a first target feature sequence, a second target feature sequence, and a third target feature sequence; wherein the content of the first and second substances,

the feature type corresponding to the first target feature sequence is an image feature type, the feature type corresponding to the second target feature sequence is an optical flow feature type, and the feature type corresponding to the third target feature sequence is a voice feature type.

In another embodiment, the at least two first signature sequences may include only the first target signature sequence and the second target signature sequence; wherein the content of the first and second substances,

the feature type corresponding to the first target feature sequence is any one of an image feature type, an optical flow feature type and a voice feature type; the feature type corresponding to the second target feature sequence is any one of an image feature type, an optical flow feature type and a voice feature type.

It should be noted that the first feature sequences with different feature types can be regarded as different modal features of the video to be classified, and multi-modal fusion can be realized by classifying the video by using at least two first feature sequences, so that the robustness and the precision of classification are improved.

Therefore, in this embodiment, based on the multi-modal features of the video to be classified, the video may be classified by using two pyramid attention networks, namely, the time pyramid attention network and the channel pyramid attention network. Specifically, the method includes the steps of firstly extracting key features such as image features, optical flow features and voice features of a video by using a model based on a convolutional neural network, then sequentially passing first feature sequences of various feature types through a time pyramid attention network and a channel pyramid attention network, then connecting and fusing the features of various feature types to obtain a target vector representing the features of the whole video to be classified, and finally classifying the video to be classified through a full-connection network to obtain the possible probability of the video to be classified in each category, so that the classification of the video is realized.

By the method, the weakness that time sequence information is not considered in the prior art can be overcome by using the time pyramid attention network, and the whole classification accuracy and classification efficiency can be improved by using the channel pyramid attention network, so that the video classification method in the embodiment can obtain good results in video classification scenes with single labels, multiple labels, short videos, long videos, weak time sequence dependence and strong time sequence dependence, the training and optimization time for different classification scenes can be reduced by using the method, the whole process is more concise and intelligent, and the labor cost is saved.

The following describes a video classification apparatus according to an embodiment of the present invention.

Referring to fig. 6, a block diagram of a video classification apparatus 600 according to an embodiment of the present invention is shown. As shown in fig. 6, the video classification apparatus 600 includes:

a first obtaining module 601, configured to obtain a first feature sequence of a video to be classified; wherein the features in the first feature sequence are arranged in time order;

a second obtaining module 602, configured to input the first feature sequence into the target pyramid attention network, and obtain a first output result output by the target pyramid attention network;

a third obtaining module 603, configured to obtain a target vector according to the first output result;

the classification module 604 is configured to classify the video to be classified according to the target vector.

Optionally, the third obtaining module 603 includes:

a first obtaining unit, configured to input each second feature sequence in the first output result into the channel pyramid attention network, so as to obtain a second output result that is output by the channel pyramid attention network and corresponds to each second feature sequence;

the second obtaining unit is used for obtaining a target vector according to a second output result corresponding to each second characteristic sequence;

Optionally, any second output result further includes a weight corresponding to each sub-feature in each sub-feature sequence included in the second output result;

a second obtaining unit including:

the first obtaining subunit is configured to perform weighted summation on each sub-feature sequence in each second output result according to each sub-feature and corresponding weight in the sub-feature sequence, so as to obtain a corresponding feature vector;

the second obtaining subunit is used for performing splicing operation according to the feature vectors corresponding to all the sub-feature sequences to obtain a spliced vector;

and the determining subunit is used for taking the splicing vector as a target vector.

Optionally, the classification module 604 is specifically configured to:

Optionally, the at least two first feature sequences comprise a first target feature sequence, a second target feature sequence and a third target feature sequence; wherein the content of the first and second substances,

The following describes an electronic device provided in an embodiment of the present invention.

Referring to fig. 7, a schematic structural diagram of an electronic device 700 according to an embodiment of the present invention is shown. As shown in fig. 7, the electronic device 700 includes: a processor 701, a memory 703, a user interface 704, and a bus interface.

The processor 701 is configured to read the program in the memory 703 and execute the following processes:

obtaining a first characteristic sequence of a video to be classified; wherein the features in the first feature sequence are arranged in time order;

obtaining a target vector according to the first output result;

and classifying the video to be classified according to the target vector.

In fig. 7, the bus architecture may include any number of interconnected buses and bridges, with one or more processors, represented by processor 701, and various circuits, represented by memory 703, being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The user interface 704 may also be an interface capable of interfacing with a desired device for different user devices, including but not limited to a keypad, display, speaker, microphone, joystick, etc.

The processor 701 is responsible for managing the bus architecture and general processing, and the memory 703 may store data used by the processor 701 in performing operations.

Optionally, the processor 701 is specifically configured to:

the processor 701 is specifically configured to:

and taking the splicing vector as a target vector.

Optionally, the processor 701 is specifically configured to:

Preferably, an embodiment of the present invention further provides an electronic device, which includes a processor 701, a memory 703, and a computer program stored in the memory 703 and capable of running on the processor 701, where the computer program, when executed by the processor 701, implements each process of the above-mentioned video classification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the video classification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for video classification, the method comprising:

obtaining a target vector according to the first output result;

classifying the video to be classified according to the target vector;

the target pyramid attention network is a time pyramid attention network;

the first output result comprises M feature sequence sets with different time scales, each feature sequence set is composed of second feature sequences obtained by dividing the first feature sequence according to the corresponding time scale, the second feature sequences in each feature sequence set are arranged according to a time sequence, the features in each second feature sequence are arranged according to the time sequence, and M is an integer greater than 1;

the obtaining a target vector according to the first output result includes:

inputting each second feature sequence in the first output result into a channel pyramid attention network respectively to obtain a second output result which is output by the channel pyramid attention network respectively and corresponds to each second feature sequence;

the second output result corresponding to any one of the second feature sequences includes N sub-feature sequence sets with different feature fine granularities, each sub-feature sequence set is composed of sub-feature sequences obtained by dividing one second feature sequence according to the corresponding feature fine granularity, sub-features in each sub-feature sequence are arranged according to a time sequence, and N is an integer greater than 1.

2. The method according to claim 1, wherein the larger the value of M, the longer the video duration of the video to be classified.

3. The method according to claim 1, wherein the larger the value of N, the longer the video duration of the video to be classified.

4. The method according to claim 1, wherein any one of the second output results further includes a weight corresponding to each of the sub-features included in each of the sub-feature sequences;

the obtaining a target vector according to a second output result corresponding to each second feature sequence includes:

and taking the splicing vector as a target vector.

5. The method according to claim 1, wherein the classifying the video to be classified according to the target vector comprises:

and inputting the target vector into a full-connection network to obtain a classification result of the video to be classified output by the full-connection network.

6. The method according to claim 1, wherein the number of the first signature sequences is at least two, and the signature types corresponding to each of the first signature sequences are different.

7. The method of claim 6, wherein at least two of the first feature sequences comprise a first target feature sequence, a second target feature sequence, and a third target feature sequence; wherein the content of the first and second substances,

8. An apparatus for video classification, the apparatus comprising:

the classification module is used for classifying the video to be classified according to the target vector;

the target pyramid attention network is a time pyramid attention network;

the third obtaining module includes:

a first obtaining unit, configured to input each second feature sequence in the first output result into a channel pyramid attention network, so as to obtain a second output result, which is output by the channel pyramid attention network and corresponds to each second feature sequence;

a second obtaining unit, configured to obtain a target vector according to a second output result corresponding to each second feature sequence;

9. The apparatus of claim 8, wherein the larger the value of M, the longer the video duration of the video to be classified.

10. The apparatus of claim 8, wherein the larger the value of N, the longer the video duration of the video to be classified.

11. The apparatus according to claim 8, wherein any one of the second output results further includes a weight corresponding to each of the sub-features included in each of the sub-feature sequences;

the second obtaining unit includes:

the first obtaining subunit is configured to perform weighted summation on each sub-feature sequence in each second output result according to each sub-feature and a corresponding weight in the sub-feature sequence, so as to obtain a corresponding feature vector;

the second obtaining subunit is used for performing splicing operation according to the feature vectors corresponding to all the sub-feature sequences to obtain spliced vectors;

12. The apparatus of claim 8, wherein the classification module is specifically configured to:

13. The apparatus according to claim 8, wherein the number of the first signature sequences is at least two, and the signature types corresponding to each of the first signature sequences are different.

14. The apparatus of claim 13, wherein at least two of the first feature sequences comprise a first target feature sequence, a second target feature sequence, and a third target feature sequence; wherein the content of the first and second substances,

15. An electronic device, comprising a processor, a memory, a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the video classification method according to any one of claims 1 to 7.

16. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the video classification method according to one of claims 1 to 7.