CN110096617B - Video classification method and device, electronic equipment and computer-readable storage medium - Google Patents

Video classification method and device, electronic equipment and computer-readable storage medium Download PDF

Info

Publication number
CN110096617B
CN110096617B CN201910357559.2A CN201910357559A CN110096617B CN 110096617 B CN110096617 B CN 110096617B CN 201910357559 A CN201910357559 A CN 201910357559A CN 110096617 B CN110096617 B CN 110096617B
Authority
CN
China
Prior art keywords
feature
video
sequence
target
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910357559.2A
Other languages
Chinese (zh)
Other versions
CN110096617A (en
Inventor
龙翔
何栋梁
李甫
迟至真
周志超
赵翔
李鑫
文石磊
丁二锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910357559.2A priority Critical patent/CN110096617B/en
Publication of CN110096617A publication Critical patent/CN110096617A/en
Application granted granted Critical
Publication of CN110096617B publication Critical patent/CN110096617B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The invention provides a video classification method, a video classification device, electronic equipment and a computer-readable storage medium. The method comprises the following steps: obtaining a first characteristic sequence of a video to be classified; wherein the features in the first feature sequence are arranged in time order; inputting the first characteristic sequence into a target pyramid attention network to obtain a first output result output by the target pyramid attention network; obtaining a target vector according to the first output result; and classifying the video to be classified according to the target vector. Compared with the prior art, the embodiment of the invention can effectively improve the video classification efficiency, and the target pyramid attention network adopts an attention type method which can extract and fuse the most effective characteristics of the video for video classification, so that the accuracy of the classification result can be better ensured.

Description

Video classification method and device, electronic equipment and computer-readable storage medium
Technical Field
The embodiment of the invention relates to the technical field of video classification, in particular to a video classification method, a video classification device, electronic equipment and a computer-readable storage medium.
Background
Video classification is one of the most important and basic tasks in computer vision, and is to classify videos into pre-defined categories by analyzing and understanding related information of the videos, wherein the video classification plays a key role in application scenes such as video search and video recommendation, and is also an important dependence on video technologies such as video tagging, video monitoring and video title generation.
At present, the common video classification methods are: and directly inputting all frames of the video into equipment for video classification to obtain a classification result output by the equipment. In this way, all frames of the video need to be analyzed, and the classification efficiency of the video is very low.
Disclosure of Invention
The embodiment of the invention provides a video classification method, a video classification device, electronic equipment and a computer readable storage medium, and aims to solve the problem of low classification efficiency of the existing video classification mode.
In order to solve the technical problem, the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a video classification method, where the method includes:
obtaining a first characteristic sequence of a video to be classified; wherein the features in the first feature sequence are arranged in chronological order;
inputting the first characteristic sequence into a target pyramid attention network to obtain a first output result output by the target pyramid attention network;
obtaining a target vector according to the first output result;
and classifying the videos to be classified according to the target vectors.
In a second aspect, an embodiment of the present invention provides a video classification apparatus, where the apparatus includes:
the first obtaining module is used for obtaining a first characteristic sequence of the video to be classified; wherein the features in the first feature sequence are arranged in chronological order;
a second obtaining module, configured to input the first feature sequence into a target pyramid attention network, and obtain a first output result output by the target pyramid attention network;
a third obtaining module, configured to obtain a target vector according to the first output result;
and the classification module is used for classifying the video to be classified according to the target vector.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the steps of the video classification method described above.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the video classification method described above.
In the embodiment of the invention, in order to realize the classification of the video, the first feature sequence of the video to be classified can be input into the target pyramid attention network, then the target vector is obtained according to the first output result output by the target pyramid attention network, and finally the video to be classified can be classified according to the target vector. Therefore, in the embodiment of the invention, the classification of the video can be realized by utilizing the first feature sequence of the video to be classified and the target pyramid attention network, so that compared with the situation that all frames in the video to be classified are required to be analyzed in the prior art, the classification efficiency of the video can be effectively improved, and the target pyramid attention network adopts an attention type method which can extract and fuse the most effective features of the video for video classification, so that the accuracy of the classification result can be better ensured.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a flowchart of a video classification method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a video classification method according to an embodiment of the present invention;
fig. 3 is a sequence diagram of the use of the charging plug;
FIG. 4 is a second schematic diagram of a video classification method according to an embodiment of the present invention;
fig. 5 is a third schematic diagram of a video classification method according to an embodiment of the present invention;
fig. 6 is a block diagram of a video classification apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
First, a video classification method provided by an embodiment of the present invention is described below.
It should be noted that the video classification method provided by the embodiment of the present invention is applied to electronic devices. Specifically, the electronic device may be a server, of course, the type of the electronic device is not limited to the server, and the electronic device may also be other types of devices that can be used for video classification.
Referring to fig. 1, a flowchart of a video classification method according to an embodiment of the present invention is shown. As shown in fig. 1, the method comprises the steps of:
step 101, obtaining a first characteristic sequence of a video to be classified; wherein the features in the first sequence of features are arranged in chronological order.
In step 101, the electronic device may extract key features of a video by using a model based on a Convolutional Neural Network (CNN) to obtain a first feature sequence of the video to be classified; wherein, the features in the first feature sequence can be arranged in the order of time from early to late. It is understood that CNN is one of a class of feed Forward Neural Networks (FNN) that includes convolution calculations and has a deep structure, and that CNN is also one of the representative algorithms of deep learning (deep learning).
And 102, inputting the first characteristic sequence into the target pyramid attention network to obtain a first output result output by the target pyramid attention network.
Here, only one type of pyramid attention network may be included in the target pyramid attention network, e.g., only a temporal pyramid attention network or a channel pyramid attention network may be included in the target pyramid attention network; alternatively, at least two types of pyramid attention networks may be included in the target pyramid attention network, for example, both the temporal pyramid attention network and the channel pyramid attention network may be included in the target pyramid attention network.
If the target pyramid attention network includes at least two types of pyramid attention networks, in step 102, the first feature sequence may be respectively input into each type of pyramid attention network to respectively obtain a first output result output by each type of pyramid attention network, and subsequent step 103 may be performed according to the first output result of each type of pyramid attention network.
And 103, acquiring a target vector according to the first output result.
Here, the target vector is a vector that can represent the features of the entire video to be classified. It should be noted that, according to the first output result, the specific implementation form of obtaining the target vector is various, and for clarity of layout, the following description is made by way of example.
And 104, classifying the video to be classified according to the target vector.
It should be noted that there may be K video categories, which are in turn B, involved in the embodiment of the present invention1、B2、……、BK(ii) a Wherein K is an integer greater than 1. After the video to be classified is classified according to the target vector, the classification result obtained by the electronic device may include K probability values, which are G in turn1、G2、……、GK(ii) a Wherein G is1For video to be classified as belonging to B1Probability value of this video category, G2For video to be classified as belonging to B2Probability value of this video category, … …, GKFor video to be classified as belonging to BKProbability values for such video categories.
In step 104, if an adaptation of a single label is performed when video classification is to be classified, then G1、G2、……、GKThe sum of the K probability values is 1; if, during the video classification to be classified, a multi-label adaptation is carried out, then G1、G2、……、GKThe sum of the K probability values may or may not be 1.
In the embodiment of the invention, in order to realize the classification of the video, the first feature sequence of the video to be classified can be input into the target pyramid attention network, then the target vector is obtained according to the first output result output by the target pyramid attention network, and finally the video to be classified can be classified according to the target vector. Therefore, in the embodiment of the invention, the classification of the video can be realized by utilizing the first feature sequence of the video to be classified and the target pyramid attention network, so that compared with the situation that all frames in the video to be classified are required to be analyzed in the prior art, the classification efficiency of the video can be effectively improved, and the target pyramid attention network adopts an attention type method which can extract and fuse the most effective features of the video for video classification, so that the accuracy of the classification result can be better ensured.
Optionally, classifying the video to be classified according to the target vector, including:
and inputting the target vector into the full-connection network to obtain a classification result of the video to be classified output by the full-connection network.
Here, a fully connected network may be considered as a pre-trained classification model stored locally in the electronic device; the classification model may be obtained by training a large number of video types as output and using target vectors of the large number of videos as input. Specifically, the classification model may be trained by the electronic device itself; alternatively, the classification model may be trained by other devices and then distributed to the electronic devices.
In this embodiment, the classification result of the video to be classified can be obtained only by inputting the target vector into the full-connection network, and therefore, the operation of obtaining the classification result of the video to be classified is very convenient to implement.
Optionally, the target pyramid attention network is a temporal pyramid attention network;
the first output result comprises M feature sequence sets with different time scales, each feature sequence set is composed of second feature sequences obtained by dividing the first feature sequences according to the corresponding time scales, the second feature sequences in each feature sequence set are arranged according to the time sequence, the features in each second feature sequence are arranged according to the time sequence, and M is an integer larger than 1.
Here, the value of M may be 2, 3, 4, 5, 6, or an integer greater than 6, which is not listed here. In addition, the number of second feature sequences in each feature sequence set may be different due to the different time scales of the feature sequence sets.
Suppose that the first feature sequence of the video to be classified is X in FIG. 2(1)1,X(1)1Including features x arranged in time sequence1Characteristic x2Characteristic x3Characteristic x4Characteristic x5Characteristic x6Characteristic x7And feature x8In the reaction of X(1)1After the temporal pyramid attention network is input, the first output result of the temporal pyramid attention network output may include 3 feature sequence sets. That is, M takes a value of 3, and at this time, it can be considered thatThe pyramid level of the time pyramid attention network is 3 levels, such as level 1, level 2, and level3 in fig. 2, where level 1, level 2, and level3 may respectively correspond to feature sequence sets of different time scales.
Specifically, the feature sequence set corresponding to level 1 may be formed by a second feature sequence, X(1)As this second signature sequence. The feature sequence set corresponding to level 2 may be represented by X(2)1And X(2)2The two second characteristic sequences are formed; wherein, X(2)1Including x arranged in time sequence1、x2、x3And x4,X(2)2Including x arranged in time sequence5、x6、x7And x8(ii) a The feature sequence set corresponding to level3 may be represented by X(3)1、X(3)2、X(3)3And X(3)4The four second characteristic sequences are formed; wherein, X(3)1Including x arranged in time sequence1And x2,X(3)2Including x arranged in time sequence3And x4,X(3)3Including x arranged in time sequence5And x6,X(3)4Including x arranged in time sequence7And x8
It is easy to see that the feature sequence set corresponding to level 1 is X(1)1The characteristic sequence set corresponding to level 2 is obtained by dividing time into one part(1)1Obtained by dividing time into two equal parts, and the characteristic sequence set corresponding to level3 is obtained by dividing X(1)1Obtained by dividing the time into four equal parts.
In this way, the electronic device can obtain a first output result including the feature sequence set corresponding to level 1, the feature sequence set corresponding to level 2, and the feature sequence set corresponding to level 3. Then, a target vector can be obtained according to the first output result, and classification of the video to be classified is realized according to the target vector.
It should be noted that, when classifying videos, if the video's chronological order is not considered at all, all features are put in an unordered set, all key features are treated equally in one group, and the chronological association between features is ignored completely, which is effective in some scenarios but not effective in other scenarios. For example, as shown in fig. 3, if all the key features are treated out of order in one group, it is impossible to distinguish whether the user operates to insert or extract the charging plug into or from the outlet.
In view of this, in this embodiment, a time pyramid attention network may be used, the first feature sequence of the video to be classified is first divided into a plurality of second feature sequences under a plurality of time scales, and then a feature sequence set of the plurality of time scales is obtained, the second feature sequences in the feature sequence set are all arranged according to a time sequence, and features in the second feature sequences are also arranged according to the time sequence, so that a time sequence may be introduced into an unordered attention mechanism, so as to effectively solve the video classification problem with strong time sequence dependency. It can be seen that the present embodiment is not only suitable for video classification in a weak timing dependence scene, but also suitable for video classification in a strong timing dependence scene.
Optionally, the larger the value of M, the longer the video duration of the video to be classified.
Specifically, the electronic device may pre-store a corresponding relationship between the video duration range and a value of M; the video duration range of 10 minutes to 15 minutes may correspond to the value of 5, the video duration range of 5 minutes to 10 minutes may correspond to the value of 4, and the video duration range of 0 minutes to 5 minutes may correspond to the value of 3.
Then, in the case that the video duration of the video to be classified is within the video duration range of 10 minutes to 15 minutes, the first output result may include 5 feature sequence sets with different time scales, and at this time, the temporal pyramid of the temporal pyramid attention network has a level of 5. Under the condition that the video duration of the video to be classified is within the video duration range of 0 minute to 5 minutes, the first output result may include 3 feature sequence sets with different time scales, and at this time, the time pyramid of the time pyramid attention network has a level of 3.
As can be seen, in this embodiment, when the time pyramid attention network is used, the hierarchy of the time pyramid is not completely fixed, and the hierarchy of the time pyramid can be flexibly adjusted according to the duration of the video to be classified, so that the hierarchy of the time pyramid is matched with the duration of the video to be classified, thereby ensuring the classification efficiency and the classification effect.
Optionally, obtaining a target vector according to the first output result includes:
inputting each second feature sequence in the first output result into the channel pyramid attention network respectively to obtain a second output result which is output by the channel pyramid attention network respectively and corresponds to each second feature sequence;
obtaining a target vector according to a second output result corresponding to each second characteristic sequence;
the second output result corresponding to any second feature sequence comprises N sub-feature sequence sets with different feature fine granularities, each sub-feature sequence set is composed of sub-feature sequences obtained by dividing a second feature sequence according to the corresponding feature fine granularity, sub-features in each sub-feature sequence are arranged according to a time sequence, and N is an integer larger than 1.
Here, N may be 2, 3, 4, 5, 6, or an integer greater than 6, which is not listed here. In addition, the value of M and the value of N can be the same or different.
Assuming that the channel pyramid attention network is denoted by CPATt, as shown in FIG. 2, the result includes X(1)、X(2)1、X(2)2、X(3)1、X(3)2、X(3)3And X(3)4After the first output results of the 7 second feature sequences, the electronic device can input the 7 second feature sequences into the CPAtt respectively to obtain 7 second output results corresponding to the 7 second feature sequences output by the CPAtt.
It is assumed that one of the 7 second signature sequences can also be represented as X in fig. 4(1)1And X(1)1Including features x arranged in time sequence1Characteristic x2… …, feature xLIn the reaction of X(1)1After the channel pyramid attention network is input, a second output result output by the channel pyramid attention network may include 3 sub-feature sequence sets with different feature fine granularities. That is to say, the value of N is 3, at this time, the pyramid level of the channel pyramid attention network may be considered as 3 layers, for example, level 1, level 2, and level3 in fig. 4, where level 1, level 2, and level3 may respectively correspond to sub-feature sequence sets of different feature fine granularities.
Specifically, the sub-feature sequence set corresponding to level 1 is composed of a sub-feature sequence, and X is(1)1As this sub-signature sequence. The sub-feature sequence set corresponding to level 2 may be represented by X(2)1And X(2)2The two sub-signature sequences; wherein, X(2)1Including a reaction of x1One of the two sub-features obtained by segmentation is x2One of the two sub-features obtained by the segmentation, … …, is represented by xLOne of the two sub-features obtained by segmentation, X(2)2Including a reaction of x1The other of the two sub-features obtained by the segmentation is x2The other of the two sub-features obtained by the segmentation, … …, is represented by xLThe other of the two resulting sub-features is segmented. The set of sub-feature sequences corresponding to level3 may be represented by X(3)1、X(3)2、X(3)3And X(3)4The four sub-characteristic sequences; wherein, X(3)1Including a reaction of x1The first of the four sub-features obtained by segmentation is represented by x2The first of the four sub-features obtained by the segmentation, … …, is represented by xLSegmenting a first one of the four sub-features; x(3)2Including a reaction of x1The second of the four sub-features obtained by segmentation is x2The second of the four sub-features obtained by the segmentation, … …, is represented by xLThe second of the four sub-features obtained by segmentation, X(3)3And X(3)4The contents included therein are analogized and will not be described in detail herein.
It should be noted that, the content included in the second output result corresponding to the other second feature sequences may refer to the above description, and is not described herein again. And then, obtaining a target vector according to the second output result corresponding to each second characteristic sequence.
In a specific embodiment, any one of the second output results further includes a weight corresponding to each sub-feature in each sub-feature sequence included in the second output result;
obtaining a target vector according to a second output result corresponding to each second feature sequence, including:
for each sub-feature sequence in each second output result, performing weighted summation according to each sub-feature and corresponding weight in the sub-feature sequence to obtain a corresponding feature vector;
performing splicing operation according to the feature vectors corresponding to all the sub-feature sequences to obtain spliced vectors;
and taking the splicing vector as a target vector.
Specifically, for X described above(2)1This sequence of sub-signatures is assumed to include sub-signatures of x in order11、x21、……、xL1Wherein x is11、x21、……、xL1Are all in vector form, and x11Corresponding weight is z1、x21Corresponding weight is z2、xL1Corresponding weight is zLThen, X(2)1The corresponding feature vector y can be calculated by adopting the following formula:
y=x11z1+x21z2+……+xL1zL
note that, the feature vectors corresponding to other sub-feature sequences are calculated in the manner described above with reference to the pair X(2)1The description of this sub-signature sequence is only required, and is not repeated herein. After the feature vectors corresponding to all the sub-feature sequences are obtained, the feature vectors may be subjected to a splicing operation to obtain a spliced vector used as a target vector.
It should be noted that Att in fig. 4 can be regarded as a calculation operation of a feature vector, and fig. 2 and the drawingsThe Contact in 4 can be considered as a vector splicing operation. As shown in fig. 4, X can be paired(2)1Corresponding feature vector sum X(2)2Performing a stitching operation on the corresponding feature vectors to obtain a first stitching vector, e.g. obtaining y in FIG. 4(2)(ii) a To X(3)1Corresponding feature vector, X(3)2Corresponding feature vector, X(3)3Corresponding feature vector sum X(3)4The corresponding feature vectors are subjected to a stitching operation to obtain a second stitching vector, e.g. y3 in FIG. 4(3). Then, for X(1)1Corresponding feature vector (e.g., y in FIG. 4)(1)) And performing splicing operation on the first splicing vector and the second splicing vector to obtain a third splicing vector, wherein the third splicing vector corresponds to a certain second characteristic sequence in the 7 second characteristic sequences.
Then, in a similar manner to the above-described procedure, 6 third splicing vectors corresponding to the other 6 second feature sequences can be obtained, that is, X can be finally obtained(1)、X(2)1、X(2)2、X(3)1、X(3)2、X(3)3And X(3)4These 7 second signature sequences correspond to 7 third stitching vectors. In this case, as shown in FIG. 2, X may be selected(2)1Corresponding third splicing vector sum X(2)2Performing a stitching operation on the corresponding third stitching vector to obtain a fourth stitching vector, e.g. obtaining y in FIG. 2(2)(ii) a To X(3)1Corresponding third splicing vector, X(3)2Corresponding third splicing vector, X(3)3Corresponding third splicing vector sum X(3)4Performing a stitching operation on the corresponding third stitching vector to obtain a fifth stitching vector, e.g. obtaining y in FIG. 2(3). Then, for X(2)1Corresponding third stitching vector (e.g., y in FIG. 2)(1)) And performing splicing operation on the fourth splicing vector and the fifth splicing vector to obtain a splicing vector used as a target vector.
It should be noted that, when classifying the video, if the electronic device directly calculates a weight for each feature, the weight of each feature can be directly used for video classification, but in many cases, only a part of channels in the video to be classified contribute to video classification. For example, as shown in fig. 5, the video to be classified may include two video frames, Frame1 and Frame2, both of which contribute to video classification, but the important channels in the two video frames are significantly different, the important channel of Frame1 corresponds to the region enclosed by the rectangular Frame 510, and the important channel of Frame2 corresponds to the region enclosed by the rectangular Frame 520. On the basis of fig. 5, if weights are assigned to the entire features of two video frames respectively, as shown in the lower left corner of fig. 5, only relatively balanced weights can be provided for the two video frames, for example, the weights assigned to the features, i.e., Feature1 and Feature2, may be both 0.5, so that the weight of the irrelevant noise is also 0.5, and the important channels of the two features become weaker after weighted averaging, which may result in lower accuracy of video classification.
In view of this, in this embodiment, a channel pyramid attention network may be used to gradually divide each feature into a plurality of sub-features from coarse to fine, and assign corresponding weights to each sub-feature, so that, as shown in the lower right corner of fig. 5, the weight of the significant portion in each feature may be set to 1.0 and the weight of the insignificant portion may be set to 0.0, for example, the sub-Feature of Feature1, which is in the top half, may be weighted to 1.0, the sub-Feature of Feature1, which is in the bottom half, may be weighted to 0.0, and, the weight of the sub-Feature of Feature2 in the upper half may be set to 0.0, and the weight of the sub-Feature of Feature2 in the lower half may be set to 1.0, so that, after the subsequent weighting operation, the important channel information can be completely reserved, so that a more accurate classification result can be obtained. Therefore, in the embodiment, the accuracy of the classification result can be effectively ensured by using the channel pyramid attention network.
Optionally, the larger the value of N, the longer the video duration of the video to be classified.
Specifically, the electronic device may pre-store a correspondence between the video duration range and the value of N; the video duration range of 10 minutes to 15 minutes may correspond to the value of 5, the video duration range of 5 minutes to 10 minutes may correspond to the value of 4, and the video duration range of 0 minutes to 5 minutes may correspond to the value of 3.
Then, under the condition that the video duration of the video to be classified is within the video duration range of 10 minutes to 15 minutes, each second output result may include 5 sub-feature sequence sets with different feature fine granularities, and at this time, the level of the channel pyramid attention network is 5 levels. Under the condition that the video duration of the video to be classified is within the video duration range of 0 minute to 5 minutes, each second output result may include 3 sub-feature sequence sets with different feature fine granularities, and at this time, the level of the channel pyramid attention network is 3 levels.
It can be seen that, in this embodiment, when the channel pyramid attention network is used, the level of the channel pyramid is not completely fixed and unchanged, and the level of the channel pyramid can be flexibly adjusted according to the length of the video duration of the video to be classified, so that the level of the channel pyramid is matched with the video duration of the video to be classified, and thus the classification efficiency and the classification effect are ensured.
Optionally, the number of the first feature sequences is at least two, and the feature types corresponding to each first feature sequence are different.
Here, the number of the first feature sequences may be two, three, four or more, and is not listed here.
In one embodiment, the at least two first feature sequences may include a first target feature sequence, a second target feature sequence, and a third target feature sequence; wherein the content of the first and second substances,
the feature type corresponding to the first target feature sequence is an image feature type, the feature type corresponding to the second target feature sequence is an optical flow feature type, and the feature type corresponding to the third target feature sequence is a voice feature type.
In another embodiment, the at least two first signature sequences may include only the first target signature sequence and the second target signature sequence; wherein the content of the first and second substances,
the feature type corresponding to the first target feature sequence is any one of an image feature type, an optical flow feature type and a voice feature type; the feature type corresponding to the second target feature sequence is any one of an image feature type, an optical flow feature type and a voice feature type.
It should be noted that the first feature sequences with different feature types can be regarded as different modal features of the video to be classified, and multi-modal fusion can be realized by classifying the video by using at least two first feature sequences, so that the robustness and the precision of classification are improved.
Therefore, in this embodiment, based on the multi-modal features of the video to be classified, the video may be classified by using two pyramid attention networks, namely, the time pyramid attention network and the channel pyramid attention network. Specifically, the method includes the steps of firstly extracting key features such as image features, optical flow features and voice features of a video by using a model based on a convolutional neural network, then sequentially passing first feature sequences of various feature types through a time pyramid attention network and a channel pyramid attention network, then connecting and fusing the features of various feature types to obtain a target vector representing the features of the whole video to be classified, and finally classifying the video to be classified through a full-connection network to obtain the possible probability of the video to be classified in each category, so that the classification of the video is realized.
By the method, the weakness that time sequence information is not considered in the prior art can be overcome by using the time pyramid attention network, and the whole classification accuracy and classification efficiency can be improved by using the channel pyramid attention network, so that the video classification method in the embodiment can obtain good results in video classification scenes with single labels, multiple labels, short videos, long videos, weak time sequence dependence and strong time sequence dependence, the training and optimization time for different classification scenes can be reduced by using the method, the whole process is more concise and intelligent, and the labor cost is saved.
The following describes a video classification apparatus according to an embodiment of the present invention.
Referring to fig. 6, a block diagram of a video classification apparatus 600 according to an embodiment of the present invention is shown. As shown in fig. 6, the video classification apparatus 600 includes:
a first obtaining module 601, configured to obtain a first feature sequence of a video to be classified; wherein the features in the first feature sequence are arranged in time order;
a second obtaining module 602, configured to input the first feature sequence into the target pyramid attention network, and obtain a first output result output by the target pyramid attention network;
a third obtaining module 603, configured to obtain a target vector according to the first output result;
the classification module 604 is configured to classify the video to be classified according to the target vector.
Optionally, the target pyramid attention network is a temporal pyramid attention network;
the first output result comprises M feature sequence sets with different time scales, each feature sequence set is composed of second feature sequences obtained by dividing the first feature sequences according to the corresponding time scales, the second feature sequences in each feature sequence set are arranged according to the time sequence, the features in each second feature sequence are arranged according to the time sequence, and M is an integer larger than 1.
Optionally, the larger the value of M, the longer the video duration of the video to be classified.
Optionally, the third obtaining module 603 includes:
a first obtaining unit, configured to input each second feature sequence in the first output result into the channel pyramid attention network, so as to obtain a second output result that is output by the channel pyramid attention network and corresponds to each second feature sequence;
the second obtaining unit is used for obtaining a target vector according to a second output result corresponding to each second characteristic sequence;
the second output result corresponding to any second feature sequence comprises N sub-feature sequence sets with different feature fine granularities, each sub-feature sequence set is composed of sub-feature sequences obtained by dividing a second feature sequence according to the corresponding feature fine granularity, sub-features in each sub-feature sequence are arranged according to a time sequence, and N is an integer larger than 1.
Optionally, the larger the value of N, the longer the video duration of the video to be classified.
Optionally, any second output result further includes a weight corresponding to each sub-feature in each sub-feature sequence included in the second output result;
a second obtaining unit including:
the first obtaining subunit is configured to perform weighted summation on each sub-feature sequence in each second output result according to each sub-feature and corresponding weight in the sub-feature sequence, so as to obtain a corresponding feature vector;
the second obtaining subunit is used for performing splicing operation according to the feature vectors corresponding to all the sub-feature sequences to obtain a spliced vector;
and the determining subunit is used for taking the splicing vector as a target vector.
Optionally, the classification module 604 is specifically configured to:
and inputting the target vector into the full-connection network to obtain a classification result of the video to be classified output by the full-connection network.
Optionally, the number of the first feature sequences is at least two, and the feature types corresponding to each first feature sequence are different.
Optionally, the at least two first feature sequences comprise a first target feature sequence, a second target feature sequence and a third target feature sequence; wherein the content of the first and second substances,
the feature type corresponding to the first target feature sequence is an image feature type, the feature type corresponding to the second target feature sequence is an optical flow feature type, and the feature type corresponding to the third target feature sequence is a voice feature type.
In the embodiment of the invention, in order to realize the classification of the video, the first feature sequence of the video to be classified can be input into the target pyramid attention network, then the target vector is obtained according to the first output result output by the target pyramid attention network, and finally the video to be classified can be classified according to the target vector. Therefore, in the embodiment of the invention, the classification of the video can be realized by utilizing the first feature sequence of the video to be classified and the target pyramid attention network, so that compared with the situation that all frames in the video to be classified are required to be analyzed in the prior art, the classification efficiency of the video can be effectively improved, and the target pyramid attention network adopts an attention type method which can extract and fuse the most effective features of the video for video classification, so that the accuracy of the classification result can be better ensured.
The following describes an electronic device provided in an embodiment of the present invention.
Referring to fig. 7, a schematic structural diagram of an electronic device 700 according to an embodiment of the present invention is shown. As shown in fig. 7, the electronic device 700 includes: a processor 701, a memory 703, a user interface 704, and a bus interface.
The processor 701 is configured to read the program in the memory 703 and execute the following processes:
obtaining a first characteristic sequence of a video to be classified; wherein the features in the first feature sequence are arranged in time order;
inputting the first characteristic sequence into a target pyramid attention network to obtain a first output result output by the target pyramid attention network;
obtaining a target vector according to the first output result;
and classifying the video to be classified according to the target vector.
In fig. 7, the bus architecture may include any number of interconnected buses and bridges, with one or more processors, represented by processor 701, and various circuits, represented by memory 703, being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The user interface 704 may also be an interface capable of interfacing with a desired device for different user devices, including but not limited to a keypad, display, speaker, microphone, joystick, etc.
The processor 701 is responsible for managing the bus architecture and general processing, and the memory 703 may store data used by the processor 701 in performing operations.
Optionally, the target pyramid attention network is a temporal pyramid attention network;
the first output result comprises M feature sequence sets with different time scales, each feature sequence set is composed of second feature sequences obtained by dividing the first feature sequences according to the corresponding time scales, the second feature sequences in each feature sequence set are arranged according to the time sequence, the features in each second feature sequence are arranged according to the time sequence, and M is an integer larger than 1.
Optionally, the larger the value of M, the longer the video duration of the video to be classified.
Optionally, the processor 701 is specifically configured to:
inputting each second feature sequence in the first output result into the channel pyramid attention network respectively to obtain a second output result which is output by the channel pyramid attention network respectively and corresponds to each second feature sequence;
obtaining a target vector according to a second output result corresponding to each second characteristic sequence;
the second output result corresponding to any second feature sequence comprises N sub-feature sequence sets with different feature fine granularities, each sub-feature sequence set is composed of sub-feature sequences obtained by dividing a second feature sequence according to the corresponding feature fine granularity, sub-features in each sub-feature sequence are arranged according to a time sequence, and N is an integer larger than 1.
Optionally, the larger the value of N, the longer the video duration of the video to be classified.
Optionally, any second output result further includes a weight corresponding to each sub-feature in each sub-feature sequence included in the second output result;
the processor 701 is specifically configured to:
for each sub-feature sequence in each second output result, performing weighted summation according to each sub-feature and corresponding weight in the sub-feature sequence to obtain a corresponding feature vector;
performing splicing operation according to the feature vectors corresponding to all the sub-feature sequences to obtain spliced vectors;
and taking the splicing vector as a target vector.
Optionally, the processor 701 is specifically configured to:
and inputting the target vector into the full-connection network to obtain a classification result of the video to be classified output by the full-connection network.
Optionally, the number of the first feature sequences is at least two, and the feature types corresponding to each first feature sequence are different.
Optionally, the at least two first feature sequences comprise a first target feature sequence, a second target feature sequence and a third target feature sequence; wherein the content of the first and second substances,
the feature type corresponding to the first target feature sequence is an image feature type, the feature type corresponding to the second target feature sequence is an optical flow feature type, and the feature type corresponding to the third target feature sequence is a voice feature type.
In the embodiment of the invention, in order to realize the classification of the video, the first feature sequence of the video to be classified can be input into the target pyramid attention network, then the target vector is obtained according to the first output result output by the target pyramid attention network, and finally the video to be classified can be classified according to the target vector. Therefore, in the embodiment of the invention, the classification of the video can be realized by utilizing the first feature sequence of the video to be classified and the target pyramid attention network, so that compared with the situation that all frames in the video to be classified are required to be analyzed in the prior art, the classification efficiency of the video can be effectively improved, and the target pyramid attention network adopts an attention type method which can extract and fuse the most effective features of the video for video classification, so that the accuracy of the classification result can be better ensured.
Preferably, an embodiment of the present invention further provides an electronic device, which includes a processor 701, a memory 703, and a computer program stored in the memory 703 and capable of running on the processor 701, where the computer program, when executed by the processor 701, implements each process of the above-mentioned video classification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the video classification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (16)

1. A method for video classification, the method comprising:
obtaining a first characteristic sequence of a video to be classified; wherein the features in the first feature sequence are arranged in chronological order;
inputting the first characteristic sequence into a target pyramid attention network to obtain a first output result output by the target pyramid attention network;
obtaining a target vector according to the first output result;
classifying the video to be classified according to the target vector;
the target pyramid attention network is a time pyramid attention network;
the first output result comprises M feature sequence sets with different time scales, each feature sequence set is composed of second feature sequences obtained by dividing the first feature sequence according to the corresponding time scale, the second feature sequences in each feature sequence set are arranged according to a time sequence, the features in each second feature sequence are arranged according to the time sequence, and M is an integer greater than 1;
the obtaining a target vector according to the first output result includes:
inputting each second feature sequence in the first output result into a channel pyramid attention network respectively to obtain a second output result which is output by the channel pyramid attention network respectively and corresponds to each second feature sequence;
obtaining a target vector according to a second output result corresponding to each second characteristic sequence;
the second output result corresponding to any one of the second feature sequences includes N sub-feature sequence sets with different feature fine granularities, each sub-feature sequence set is composed of sub-feature sequences obtained by dividing one second feature sequence according to the corresponding feature fine granularity, sub-features in each sub-feature sequence are arranged according to a time sequence, and N is an integer greater than 1.
2. The method according to claim 1, wherein the larger the value of M, the longer the video duration of the video to be classified.
3. The method according to claim 1, wherein the larger the value of N, the longer the video duration of the video to be classified.
4. The method according to claim 1, wherein any one of the second output results further includes a weight corresponding to each of the sub-features included in each of the sub-feature sequences;
the obtaining a target vector according to a second output result corresponding to each second feature sequence includes:
for each sub-feature sequence in each second output result, performing weighted summation according to each sub-feature and corresponding weight in the sub-feature sequence to obtain a corresponding feature vector;
performing splicing operation according to the feature vectors corresponding to all the sub-feature sequences to obtain spliced vectors;
and taking the splicing vector as a target vector.
5. The method according to claim 1, wherein the classifying the video to be classified according to the target vector comprises:
and inputting the target vector into a full-connection network to obtain a classification result of the video to be classified output by the full-connection network.
6. The method according to claim 1, wherein the number of the first signature sequences is at least two, and the signature types corresponding to each of the first signature sequences are different.
7. The method of claim 6, wherein at least two of the first feature sequences comprise a first target feature sequence, a second target feature sequence, and a third target feature sequence; wherein the content of the first and second substances,
the feature type corresponding to the first target feature sequence is an image feature type, the feature type corresponding to the second target feature sequence is an optical flow feature type, and the feature type corresponding to the third target feature sequence is a voice feature type.
8. An apparatus for video classification, the apparatus comprising:
the first obtaining module is used for obtaining a first characteristic sequence of the video to be classified; wherein the features in the first feature sequence are arranged in chronological order;
a second obtaining module, configured to input the first feature sequence into a target pyramid attention network, and obtain a first output result output by the target pyramid attention network;
a third obtaining module, configured to obtain a target vector according to the first output result;
the classification module is used for classifying the video to be classified according to the target vector;
the target pyramid attention network is a time pyramid attention network;
the first output result comprises M feature sequence sets with different time scales, each feature sequence set is composed of second feature sequences obtained by dividing the first feature sequence according to the corresponding time scale, the second feature sequences in each feature sequence set are arranged according to a time sequence, the features in each second feature sequence are arranged according to the time sequence, and M is an integer greater than 1;
the third obtaining module includes:
a first obtaining unit, configured to input each second feature sequence in the first output result into a channel pyramid attention network, so as to obtain a second output result, which is output by the channel pyramid attention network and corresponds to each second feature sequence;
a second obtaining unit, configured to obtain a target vector according to a second output result corresponding to each second feature sequence;
the second output result corresponding to any one of the second feature sequences includes N sub-feature sequence sets with different feature fine granularities, each sub-feature sequence set is composed of sub-feature sequences obtained by dividing one second feature sequence according to the corresponding feature fine granularity, sub-features in each sub-feature sequence are arranged according to a time sequence, and N is an integer greater than 1.
9. The apparatus of claim 8, wherein the larger the value of M, the longer the video duration of the video to be classified.
10. The apparatus of claim 8, wherein the larger the value of N, the longer the video duration of the video to be classified.
11. The apparatus according to claim 8, wherein any one of the second output results further includes a weight corresponding to each of the sub-features included in each of the sub-feature sequences;
the second obtaining unit includes:
the first obtaining subunit is configured to perform weighted summation on each sub-feature sequence in each second output result according to each sub-feature and a corresponding weight in the sub-feature sequence, so as to obtain a corresponding feature vector;
the second obtaining subunit is used for performing splicing operation according to the feature vectors corresponding to all the sub-feature sequences to obtain spliced vectors;
and the determining subunit is used for taking the splicing vector as a target vector.
12. The apparatus of claim 8, wherein the classification module is specifically configured to:
and inputting the target vector into a full-connection network to obtain a classification result of the video to be classified output by the full-connection network.
13. The apparatus according to claim 8, wherein the number of the first signature sequences is at least two, and the signature types corresponding to each of the first signature sequences are different.
14. The apparatus of claim 13, wherein at least two of the first feature sequences comprise a first target feature sequence, a second target feature sequence, and a third target feature sequence; wherein the content of the first and second substances,
the feature type corresponding to the first target feature sequence is an image feature type, the feature type corresponding to the second target feature sequence is an optical flow feature type, and the feature type corresponding to the third target feature sequence is a voice feature type.
15. An electronic device, comprising a processor, a memory, a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the video classification method according to any one of claims 1 to 7.
16. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the video classification method according to one of claims 1 to 7.
CN201910357559.2A 2019-04-29 2019-04-29 Video classification method and device, electronic equipment and computer-readable storage medium Active CN110096617B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910357559.2A CN110096617B (en) 2019-04-29 2019-04-29 Video classification method and device, electronic equipment and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910357559.2A CN110096617B (en) 2019-04-29 2019-04-29 Video classification method and device, electronic equipment and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN110096617A CN110096617A (en) 2019-08-06
CN110096617B true CN110096617B (en) 2021-08-10

Family

ID=67446566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910357559.2A Active CN110096617B (en) 2019-04-29 2019-04-29 Video classification method and device, electronic equipment and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN110096617B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291643B (en) * 2020-01-20 2023-08-22 北京百度网讯科技有限公司 Video multi-label classification method, device, electronic equipment and storage medium
CN111246256B (en) * 2020-02-21 2021-05-25 华南理工大学 Video recommendation method based on multi-mode video content and multi-task learning
CN111491187B (en) * 2020-04-15 2023-10-31 腾讯科技(深圳)有限公司 Video recommendation method, device, equipment and storage medium
CN111797800B (en) * 2020-07-14 2024-03-05 中国传媒大学 Video classification method based on content mining
CN112507920B (en) * 2020-12-16 2023-01-24 重庆交通大学 Examination abnormal behavior identification method based on time displacement and attention mechanism

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830212A (en) * 2018-06-12 2018-11-16 北京大学深圳研究生院 A kind of video behavior time shaft detection method
CN109359592A (en) * 2018-10-16 2019-02-19 北京达佳互联信息技术有限公司 Processing method, device, electronic equipment and the storage medium of video frame
CN109670453A (en) * 2018-12-20 2019-04-23 杭州东信北邮信息技术有限公司 A method of extracting short video subject

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016054779A1 (en) * 2014-10-09 2016-04-14 Microsoft Technology Licensing, Llc Spatial pyramid pooling networks for image processing
US10402697B2 (en) * 2016-08-01 2019-09-03 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification
CN108416795B (en) * 2018-03-04 2022-03-18 南京理工大学 Video action identification method based on sorting pooling fusion space characteristics
CN109389055B (en) * 2018-09-21 2021-07-20 西安电子科技大学 Video classification method based on mixed convolution and attention mechanism

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830212A (en) * 2018-06-12 2018-11-16 北京大学深圳研究生院 A kind of video behavior time shaft detection method
CN109359592A (en) * 2018-10-16 2019-02-19 北京达佳互联信息技术有限公司 Processing method, device, electronic equipment and the storage medium of video frame
CN109670453A (en) * 2018-12-20 2019-04-23 杭州东信北邮信息技术有限公司 A method of extracting short video subject

Also Published As

Publication number Publication date
CN110096617A (en) 2019-08-06

Similar Documents

Publication Publication Date Title
CN110096617B (en) Video classification method and device, electronic equipment and computer-readable storage medium
JP7127120B2 (en) Video classification method, information processing method and server, and computer readable storage medium and computer program
CN108334910B (en) Event detection model training method and event detection method
CN110096938B (en) Method and device for processing action behaviors in video
CN111062964B (en) Image segmentation method and related device
CN113095346A (en) Data labeling method and data labeling device
KR102042168B1 (en) Methods and apparatuses for generating text to video based on time series adversarial neural network
CN111783712A (en) Video processing method, device, equipment and medium
CN108052862A (en) Age predictor method and device
US20230004608A1 (en) Method for content recommendation and device
CN112016450A (en) Training method and device of machine learning model and electronic equipment
CN113761359B (en) Data packet recommendation method, device, electronic equipment and storage medium
CN111061867A (en) Text generation method, equipment, storage medium and device based on quality perception
CN115082752A (en) Target detection model training method, device, equipment and medium based on weak supervision
CN113806501B (en) Training method of intention recognition model, intention recognition method and equipment
CN111144567A (en) Training method and device of neural network model
CN112100509B (en) Information recommendation method, device, server and storage medium
CN108665455B (en) Method and device for evaluating image significance prediction result
CN114170484B (en) Picture attribute prediction method and device, electronic equipment and storage medium
CN115063858A (en) Video facial expression recognition model training method, device, equipment and storage medium
CN115905613A (en) Audio and video multitask learning and evaluation method, computer equipment and medium
CN114281942A (en) Question and answer processing method, related equipment and readable storage medium
CN108985456B (en) Number-of-layers-increasing deep learning neural network training method, system, medium, and device
CN115858911A (en) Information recommendation method and device, electronic equipment and computer-readable storage medium
CN113177603A (en) Training method of classification model, video classification method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant