CN110096617A

CN110096617A - Video classification methods, device, electronic equipment and computer readable storage medium

Info

Publication number: CN110096617A
Application number: CN201910357559.2A
Authority: CN
Inventors: 龙翔; 何栋梁; 李甫; 迟至真; 周志超; 赵翔; 李鑫; 文石磊; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2019-08-06
Anticipated expiration: 2039-04-29
Also published as: CN110096617B

Abstract

The present invention provides a kind of video classification methods, device, electronic equipment and computer readable storage medium.This method comprises: obtaining the fisrt feature sequence of video to be sorted；Wherein, the feature in fisrt feature sequence arranges sequentially in time；By fisrt feature sequence inputting target pyramid attention network, the first output result of target pyramid attention network output is obtained；According to the first output as a result, obtaining object vector；According to object vector, classify to video to be sorted.Compared with prior art, the embodiment of the present invention can effectively improve the classification effectiveness of video, and, target pyramid attention network is using attention type method, its most effective feature that can extract and merge video can preferably guarantee the accuracy of classification results to be used for visual classification in this way.

Description

Video classification methods, device, electronic equipment and computer readable storage medium

Technical field

The present embodiments relate to visual classification technical fields more particularly to a kind of video classification methods, device, electronics to set Standby and computer readable storage medium.

Background technique

Visual classification is one of task most important, most basic in computer vision, and visual classification refers to by analyzing, managing The relevant information for solving video, video is assigned in predefined classification, visual classification is in video search, video recommendations etc. Key effect is played under application scenarios, the video techniques such as visual classification or video tab, video monitor, video title generation Important dependence.

Currently, common visual classification mode are as follows: directly input all frames of video and be used to carry out setting for visual classification It is standby, to obtain the classification results of equipment output.When in this way, need to analyze all frames of video, the classification of video Efficiency is very low.

Summary of the invention

The embodiment of the present invention provides a kind of video classification methods, device, electronic equipment and computer readable storage medium, with Solve the problems, such as that the classification effectiveness of existing visual classification mode is low.

In order to solve the above-mentioned technical problem, the present invention is implemented as follows:

In a first aspect, the embodiment of the present invention provides a kind of video classification methods, which comprises

Obtain the fisrt feature sequence of video to be sorted；Wherein, the feature in the fisrt feature sequence is suitable according to the time Sequence arrangement；

By the fisrt feature sequence inputting target pyramid attention network, the target pyramid attention net is obtained First output result of network output；

According to first output as a result, obtaining object vector；

According to the object vector, classify to the video to be sorted.

Second aspect, the embodiment of the present invention provide a kind of visual classification device, and described device includes:

First obtains module, for obtaining the fisrt feature sequence of video to be sorted；Wherein, in the fisrt feature sequence Feature arrange sequentially in time；

Second obtains module, for by the fisrt feature sequence inputting target pyramid attention network, described in acquisition First output result of target pyramid attention network output；

Third obtains module, for exporting according to described first as a result, obtaining object vector；

Categorization module, for classifying to the video to be sorted according to the object vector.

The third aspect, the embodiment of the present invention provide a kind of electronic equipment, including processor, memory, are stored in described deposit On reservoir and the computer program that can run on the processor, the computer program are realized when being executed by the processor The step of above-mentioned video classification methods.

Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer-readable storage medium Computer program is stored in matter, the step of computer program realizes above-mentioned video classification methods when being executed by processor.

It, can be first defeated by the fisrt feature sequence of video to be sorted in order to realize the classification of video in the embodiment of the present invention Enter target pyramid attention network, then according to the first output of target pyramid attention network output as a result, obtaining mesh Mark vector classifies to video to be sorted finally according to object vector.As it can be seen that in the embodiment of the present invention, using wait divide The fisrt feature sequence and target pyramid attention network of class video, can be realized the classification of video, in this way, with existing The case where must analyzing all frames in video to be sorted in technology, is compared, and the embodiment of the present invention can effectively improve The classification effectiveness of video, also, target pyramid attention network can be extracted and be merged using attention type method The most effective feature of video can preferably guarantee the accuracy of classification results to be used for visual classification in this way.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, needed in being described below to the embodiment of the present invention Attached drawing to be used is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, For those of ordinary skill in the art, without any creative labor, it can also obtain according to these attached drawings Take other attached drawings.

Fig. 1 is the flow chart of video classification methods provided in an embodiment of the present invention；

Fig. 2 is one of the schematic diagram of video classification methods provided in an embodiment of the present invention；

Fig. 3 is the use sequence chart of charging plug；

Fig. 4 is the two of the schematic diagram of video classification methods provided in an embodiment of the present invention；

Fig. 5 is the three of the schematic diagram of video classification methods provided in an embodiment of the present invention；

Fig. 6 is the structural block diagram of visual classification device provided in an embodiment of the present invention；

Fig. 7 is the structural schematic diagram of electronic equipment provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, those of ordinary skill in the art's acquired every other implementation without creative efforts Example, shall fall within the protection scope of the present invention.

Video classification methods provided in an embodiment of the present invention are illustrated first below.

It should be noted that video classification methods provided in an embodiment of the present invention are applied to electronic equipment.Specifically, electronics Equipment can be server, and certainly, the type of electronic equipment is not limited to server, may be other kinds of, energy It is enough in the equipment for carrying out visual classification, the embodiment of the present invention does not do any restriction to the type of electronic equipment.

Referring to Fig. 1, the flow chart of video classification methods provided in an embodiment of the present invention is shown in figure.As shown in Figure 1, should Method includes the following steps:

Step 101, the fisrt feature sequence of video to be sorted is obtained；Wherein, the feature in fisrt feature sequence according to when Between sequentially arrange.

In a step 101, electronic equipment can use based on convolutional neural networks (Convolutional Neural Networks, CNN) model, carry out video key feature extract, to obtain the fisrt feature sequence of video to be sorted；Its In, the feature in fisrt feature sequence can be arranged according to the time by the sequence of morning to evening.It is understood that CNN is a kind of Comprising convolutional calculation, and the feedforward neural network (Feedforward Neural Networks, FNN) with depth structure, One of the representative algorithm of CNN or deep learning (deep learning).

Step 102, by fisrt feature sequence inputting target pyramid attention network, target pyramid attention net is obtained First output result of network output.

It here, can only include a type of pyramid attention network in target pyramid attention network, for example, It can only include time pyramid attention network or channel pyramid attention network in target pyramid attention network；Or Person may include the pyramid attention network of at least two types in target pyramid attention network, for example, target gold word It can simultaneously include time pyramid attention network and channel pyramid attention network in tower attention network.

If including the pyramid attention network of at least two types in target pyramid attention network, in step In 102, fisrt feature sequence can be inputted to each type of pyramid attention network respectively, to obtain each type respectively The output of pyramid attention network the first output as a result, and first defeated according to each type of pyramid attention network Out as a result, executing subsequent step 103.

Step 103, according to the first output as a result, obtaining object vector.

Here, object vector is the vector that can represent the feature of entire video to be sorted.It should be noted that according to One exports the specific implementation form multiplicity as a result, acquisition object vector, clear in order to be laid out, subsequent to carry out citing introduction.

Step 104, according to object vector, classify to video to be sorted.

It should be noted that video classification involved in the embodiment of the present invention there can be K kind in total, it is followed successively by B₁、 B₂、……、B_K；Wherein, K is the integer greater than 1.According to object vector, after classifying to video to be sorted, electronic equipment It may include K probability value in obtained classification results, be followed successively by G₁、G₂、……、G_K；Wherein, G₁Belong to B for video to be sorted₁ This other probability value of video class, G₂Belong to B for video to be sorted₂This other probability value ... ... of video class, G_KIt is to be sorted Video belongs to B_KThis other probability value of video class.

At step 104, if when visual classification to be sorted, progress be single label adaptation, then, G₁、G₂、……、 G_KThis K probability value is 1 with value；If when visual classification to be sorted, progress be multi-tag adaptation, then, G₁、 G₂、……、G_KThis K probability value may be 1 with value, it is also possible to be not 1.

Optionally, according to object vector, classify to video to be sorted, comprising:

Object vector is inputted into fully-connected network, to obtain fully-connected network output, the classification results of video to be sorted.

Here, fully-connected network may be considered one and train in advance, be stored in the disaggregated model of electronic equipment local； Wherein, disaggregated model can be using the object vector of multitude of video as input, and the type of multitude of video is instructed as output It gets.Specifically, disaggregated model can be obtained by electronic equipment self training；Alternatively, disaggregated model can be by other equipment Electronic equipment is distributed to after training.

In the present embodiment, only object vector need to be inputted fully-connected network, can be obtained the classification results of video to be sorted, Therefore, the operation for obtaining the classification results of video to be sorted implements very convenient.

Optionally, target pyramid attention network is time pyramid attention network；

It include M characteristic sequence set of time scale inequality in first output result, each characteristic sequence set is by the Each second feature Sequence composition that one characteristic sequence is divided according to corresponding time scale, in each characteristic sequence set Two characteristic sequences arrange sequentially in time, and the feature in each second feature sequence arranges sequentially in time, and M is greater than 1 Integer.

Here, the value of M can be 2,3,4,5,6 or the integer greater than 6, will not enumerate herein.In addition, due to The quantity of the time scale inequality of each characteristic sequence set, the second feature sequence in each characteristic sequence set can be difference 's.

Assuming that the fisrt feature sequence of video to be sorted is the X in Fig. 2₍₁₎₁, X₍₁₎₁In include sequentially in time arranging Feature x₁, feature x₂, feature x₃, feature x₄, feature x₅, feature x₆, feature x₇With feature x₈, by X₍₁₎₁Input time pyramid It may include 3 characteristic sequence collection in the first output result of time pyramid attention network output after attention network It closes.That is, the value of M is 3, at this moment, it is believed that the pyramid level of time pyramid attention network is 3 layers, example Level 1, level 2 and level 3 in for example Fig. 2, wherein level 1, level 2 and level 3 can be respectively corresponded The characteristic sequence set of different time scales.

Specifically, the corresponding characteristic sequence set of level 1 can be by a second feature Sequence composition, X₍₁₎As this A second feature sequence.The corresponding characteristic sequence set of level 2 can be by X₍₂₎₁And X₍₂₎₂The two second feature sequence structures At；Wherein, X₍₂₎₁In include the x that arranges sequentially in time₁、x₂、x₃And x₄, X₍₂₎₂In include sequentially in time arranging x₅、x₆、x₇And x₈；The corresponding characteristic sequence set of level 3 can be by X₍₃₎₁、X₍₃₎₂、X₍₃₎₃And X₍₃₎₄This four second feature sequences Column are constituted；Wherein, X₍₃₎₁In include the x that arranges sequentially in time₁And x₂, X₍₃₎₂In include the x that arranges sequentially in time₃With x₄, X₍₃₎₃In include the x that arranges sequentially in time₅And x₆, X₍₃₎₄In include the x that arranges sequentially in time₇And x₈。

It can easily be seen that the corresponding characteristic sequence set of level 1 is by X₍₁₎₁It is divided into what portion obtained from the time, The corresponding characteristic sequence set of level 2 is by X₍₁₎₁It is divided into what two equal portions obtained from the time, the corresponding feature of level 3 Arrangement set is by X₍₁₎₁It is divided into what four equal portions obtained from the time.

In this way, electronic equipment can obtain including the corresponding characteristic sequence set of level 1, the corresponding feature of level 2 First output result of arrangement set and the corresponding characteristic sequence set of level 3.Next, can be according to the first output As a result, obtaining object vector, and according to object vector, realize the classification to video to be sorted.

It should be noted that, if giving no thought to the timing of video, all features are put when carrying out the classification of video In a unordered set, all key features are treated equally in a group, the sequential correlation between complete override feature, this It is effective in some scenes, but is invalid in other scenes.For example, as shown in figure 3, if all keys All unordered pair waits for feature in a group, then the operation that user cannot be distinguished is that charging plug is inserted into socket on earth, still Charging plug is come out of the socket.

In view of this, time pyramid attention network can be used, first by the first of video to be sorted in the present embodiment Characteristic sequence is divided into several second feature sequences under some time scale, then obtains the feature sequence of some time scale Column are gathered, and the second feature sequence in characteristic sequence set arranges sequentially in time, the feature in second feature sequence Arrange sequentially in time, timing can be introduced in unordered attention mechanism in this way, with efficiently solve strong timing according to Bad visual classification problem.As it can be seen that the present embodiment is applicable not only to the visual classification under weak Temporal dependency scene, it is also applied for strong Visual classification under Temporal dependency scene.

Optionally, the value of M is bigger, and the video length of video to be sorted is longer.

Specifically, the corresponding relationship that can be previously stored in electronic equipment between video length range and the value of M；Its In, this video length range can this value be corresponding with 5 within 10 minutes to 15 minutes, 5 minutes to 10 minutes this videos when Long range can this value be corresponding with 4, this video length range can this value be corresponding with 3 within 0 minute to 5 minutes.

So, the case where the video length of video to be sorted is located at 10 minutes to 15 minutes this video length range Under, it may include 5 characteristic sequence set of time scale inequality, at this moment, time pyramid attention in the first output result The time of network pyramidal level is 5 grades.When the video length of video to be sorted is located at 0 minute to 5 minutes this videos It may include 3 characteristic sequence set of time scale inequality, at this moment, time in the case where long range, in the first output result The time pyramidal level of pyramid attention network is 3 grades.

As it can be seen that in the present embodiment, when in use between pyramid attention network when, time pyramidal level is not complete Changeless, time pyramidal level can neatly be adjusted according to the length of the video length of video to be sorted, So that time pyramidal level matches with the video length of video to be sorted, to guarantee classification effectiveness and classification effect Fruit.

Optionally, according to the first output as a result, obtaining object vector, comprising:

Respectively by each second feature sequence inputting channel pyramid attention network in the first output result, to obtain Channel pyramid attention network exports respectively, and each second feature sequence corresponding second exports result；

According to corresponding second output of each second feature sequence as a result, obtaining object vector；

It wherein, include N number of son spy of feature fine granularity inequality in the corresponding second output result of any second feature sequence Arrangement set is levied, each subcharacter arrangement set is split by a second feature sequence according to individual features fine granularity Each subcharacter Sequence composition, the subcharacter in each subcharacter sequence arrange sequentially in time, and N is the integer greater than 1.

Here, the value of N can be 2,3,4,5,6 or the integer greater than 6, will not enumerate herein.In addition, M Value and the value of N may be the same or different.

Assuming that channel pyramid attention network is indicated with CPAtt, as shown in Fig. 2, obtaining including X₍₁₎、X₍₂₎₁、X₍₂₎₂、 X₍₃₎₁、X₍₃₎₂、X₍₃₎₃And X₍₃₎₄This 7 second feature sequences first output result after, electronic equipment can respectively by this 7 A second feature sequence inputting CPAtt, to obtain CPAtt output, the corresponding 7 second output knot of 7 second feature sequences Fruit.

Assuming that a certain second feature sequence in above-mentioned 7 second feature sequences can also be expressed as the X in Fig. 4⁽¹⁾¹, and X⁽¹⁾¹In include the feature x that arranges sequentially in time₁, feature x₂..., feature x_L, by X⁽¹⁾¹Input channel pyramid note It anticipates after power network, may include feature fine granularity inequality in the second output result of channel pyramid attention network output 3 sub- characteristic sequence set.That is, the value of N is 3, at this moment, it is believed that the golden word of channel pyramid attention network Tower level is 3 layers, and level 1, level 2 and the level 3, level 1, level 2 and level 3 in for example, Fig. 4 can To respectively correspond the fine-grained subcharacter arrangement set of different characteristic.

Specifically, the corresponding subcharacter arrangement set of level 1 is made of a sub- characteristic sequence, X⁽¹⁾¹As this height Characteristic sequence.The corresponding subcharacter arrangement set of level 2 can be by X⁽²⁾¹And X⁽²⁾²The two subcharacter Sequence compositions；Wherein, X⁽²⁾¹In include by x₁Divide one of two obtained subcharacters, by x₂Divide one in two obtained subcharacters Person ... ..., by x_LDivide obtained one of two subcharacters, X⁽²⁾²In include by x₁Divide in two obtained subcharacters Another one, by x₂Divide obtained the other of two subcharacters ... ..., by x_LDivide in two obtained subcharacters Another one.The corresponding subcharacter arrangement set of level3 can be by X⁽³⁾¹、X⁽³⁾²、X⁽³⁾³And X⁽³⁾⁴This four sub- characteristic sequence structures At；Wherein, X⁽³⁾¹In include by x₁Divide the one in four obtained subcharacters, by x₂Divide four obtained subcharacters In one ... ..., by x_LDivide the one in four obtained subcharacters；X⁽³⁾²In include by x₁Divide four obtained The two in a subcharacter, by x₂Divide the two ... ... in four obtained subcharacters, by x_LDivide four obtained The two in subcharacter, X⁽³⁾³And X⁽³⁾⁴In include content the rest may be inferred, details are not described herein.

It should be noted that the content reference for including in the corresponding second output result of other second feature sequences is above stated Bright, details are not described herein.Later, can according to each second feature sequence it is corresponding second output as a result, obtain target to Amount.

In a specific embodiment, in any second output result further include each subcharacter sequence included by it In the corresponding weight of each subcharacter；

According to corresponding second output of each second feature sequence as a result, obtaining object vector, comprising:

For each subcharacter sequence in each second output result, according to each subcharacter therein and corresponding power Weight, is weighted summation, obtains corresponding feature vector；

According to the corresponding feature vector of all subcharacter sequences, splicing operation is carried out, obtains splicing vector；

Vector will be spliced as object vector.

Specifically, for above-mentioned X⁽²⁾¹This subcharacter sequence, it is assumed that its subcharacter for including is followed successively by x₁₁、 x₂₁、……、x_L1, wherein x₁₁、x₂₁、……、x_L1It is vector form, and x₁₁Corresponding weight is z₁、x₂₁Corresponding weight is z₂、x_L1Corresponding weight is z_L, then, X⁽²⁾¹Corresponding feature vector y can be calculated using following formula:

Y=x₁₁z₁+x₂₁z₂+……+x_L1z_L

It should be noted that the calculation of other corresponding feature vectors of subcharacter sequence is referring to above-mentioned to X⁽²⁾¹This The explanation of subcharacter sequence, details are not described herein.It, can be with after obtaining the corresponding feature vector of all subcharacter sequences Splicing operation is carried out to these feature vectors, to obtain for the splicing vector as object vector.

It should be noted that the Att in Fig. 4 may be considered the calculating operation of feature vector, in Fig. 2 and Fig. 4 Contact may be considered the concatenation of vector.As shown in figure 4, can be to X⁽²⁾¹Corresponding feature vector and X⁽²⁾²It is corresponding Feature vector carries out splicing operation, to obtain the first splicing vector, such as obtains the y in Fig. 4⁽²⁾；To X⁽³⁾¹Corresponding feature to Amount, X⁽³⁾²Corresponding feature vector, X⁽³⁾³Corresponding feature vector and X⁽³⁾⁴Corresponding feature vector carries out splicing operation, with To the second splicing vector, such as obtain the y3 in Fig. 4⁽³⁾.Next, again to X⁽¹⁾¹Corresponding feature vector (such as the y in Fig. 4⁽¹⁾), first splicing vector sum second splice vector carry out splicing operation, with obtain third splicing vector, the third splice vector It is corresponding with a certain second feature sequence in above-mentioned 7 second feature sequences.

Later, it can also be obtained and other 6 second feature sequences corresponding 6 according to the mode similar with above-mentioned process A third splices vector, is also just saying, can finally obtain X₍₁₎、X₍₂₎₁、X₍₂₎₂、X₍₃₎₁、X₍₃₎₂、X₍₃₎₃And X₍₃₎₄This 7 second Corresponding 7 thirds of characteristic sequence splice vector.It at this moment, as shown in Fig. 2, can be to X₍₂₎₁Corresponding third splices vector sum X₍₂₎₂Corresponding third splicing vector carries out splicing operation, to obtain the 4th splicing vector, such as obtains the y in Fig. 2₍₂₎；It is right X₍₃₎₁Corresponding third splices vector, X₍₃₎₂Corresponding third splices vector, X₍₃₎₃Corresponding third splices vector sum X₍₃₎₄It is corresponding Third splicing vector carry out splicing operation, to obtain the 5th splicing vector, such as obtain the y in Fig. 2₍₃₎.Next, right again X₍₂₎₁Corresponding third splicing vector (such as the y in Fig. 2₍₁₎), the 4th splicing vector sum the 5th splice vector carry out splicing fortune It calculates, to obtain for the splicing vector as object vector.

It should be noted that when carrying out the classification of video, if electronic equipment is directly each feature calculation one power Weight, the weight of each feature are used directly for visual classification, but in many cases, only have passage portion in video to be sorted Facilitate visual classification.For example, as shown in figure 5, may include two videos of Frame1 and Frame2 in video to be sorted Frame, the two video frames each contribute to visual classification, still, the important channel in the two video frames be it is visibly different, The important channel of Frame1 corresponds to rectangle frame 510 and encloses the region set, and the important channel of Frame2 corresponds to rectangle frame 520 and encloses the area set Domain.On the basis of Fig. 5, weight is respectively specified that if it is the entire feature of two video frames, as shown in the lower left corner in Fig. 5, only The weight of relative equilibrium can be provided for two video frames, for example, the power specified for the two features of Feature1 and Feature2 Weight can be 0.5, in this way, the weight of uncorrelated noise is also 0.5, the important channel of two features can become after weighted average Weak, the accuracy that this will lead to visual classification is lower.

In view of this, channel pyramid attention network can be used from coarse to fine gradually will be each in the present embodiment Image Segmentation Methods Based on Features is several subcharacters, and specifies corresponding weight for each subcharacter, in this way, as shown in the lower right corner in Fig. 5, it can To set 1.0 for the weight of the pith in each feature, and 0.0 is set by the weight of inessential part, for example, can 1.0 are set as with the weight for the subcharacter that Feature1 this feature is located at top half, by this Q-character of Feature1 It is set as 0.0 in the weight of the subcharacter of lower half portion, and it is possible to which Feature2 this feature to be located to the son of top half The weight of feature is set as 0.0, sets 1.0 for the weight for the subcharacter that Feature2 this feature is located at lower half portion, this Sample, after subsequent be weighted, important channel information can be fully retained, and be conducive to obtain in this way subject to more True classification results.As it can be seen that, by using channel pyramid attention network, can effectively guarantee to classify in the present embodiment As a result accuracy.

Optionally, the value of N is bigger, and the video length of video to be sorted is longer.

Specifically, the corresponding relationship that can be previously stored in electronic equipment between video length range and the value of N；Its In, this video length range can this value be corresponding with 5 within 10 minutes to 15 minutes, 5 minutes to 10 minutes this videos when Long range can this value be corresponding with 4, this video length range can this value be corresponding with 3 within 0 minute to 5 minutes.

So, the case where the video length of video to be sorted is located at 10 minutes to 15 minutes this video length range Under, it may include 5 sub- characteristic sequence set of feature fine granularity inequality, at this moment, channel gold word in each second output result The pyramidal level in channel of tower attention network is 5 grades.Video to be sorted video length be located at 0 minute to 5 minutes this It may include 3 subcharacter sequences of feature fine granularity inequality in the case where a video length range, in each second output result Column set, at this moment, the pyramidal level in channel of channel pyramid attention network are 3 grades.

As it can be seen that when using channel pyramid attention network, the pyramidal level in channel is not complete in the present embodiment Changeless, the pyramidal level in channel can neatly be adjusted according to the length of the video length of video to be sorted, So that the pyramidal level in channel and the video length of video to be sorted match, to guarantee classification effectiveness and classification effect Fruit.

Optionally, the quantity of fisrt feature sequence is at least two, and the corresponding characteristic type of each fisrt feature sequence is mutual It is different.

Here, the quantity of fisrt feature sequence can be two, three, four or four or more, herein no longer one by one It enumerates.

In a specific embodiment, at least two fisrt feature sequences may include first object characteristic sequence, Two target signature sequences and third target signature sequence；Wherein,

The corresponding characteristic type of first object characteristic sequence is characteristics of image type, the corresponding spy of the second target signature sequence Sign type is Optical-flow Feature type, and the corresponding characteristic type of third target signature sequence is phonetic feature type.

In another embodiment specific implementation mode, at least two fisrt feature sequences can only include first object characteristic sequence With the second target signature sequence；Wherein,

The corresponding characteristic type of first object characteristic sequence is characteristics of image type, Optical-flow Feature type and phonetic feature class Any one of type；The corresponding characteristic type of second target signature sequence is characteristics of image type, Optical-flow Feature type and voice Any one of characteristic type.

It should be noted that the fisrt feature sequence of different characteristic type may be considered the different modalities of video to be sorted Feature carries out visual classification by using at least two fisrt feature sequences, can be realized multi-modal fusion, to improve classification Robustness and precision.

As it can be seen that in the present embodiment, it can be based on the multi-modal feature of video to be sorted, with time pyramid attention network With both pyramid attention networks of channel pyramid attention network, the classification of Lai Jinhang video.It specifically, can be first The key features such as characteristics of image, Optical-flow Feature and the phonetic feature of model extraction video based on convolutional neural networks are first used, Then the fisrt feature sequence of various characteristic types is successively passed through into time time pyramid attention network and channel pyramid The feature of various characteristic types is connected fusion again later by attention network, obtains the feature for representing entire video to be sorted Object vector is classified finally by a fully-connected network, so that it is possible in each classification to obtain video to be sorted Probability is so far achieved that the classification of video.

Timing information is not considered in the prior art by the above-mentioned means, can overcome using time pyramid attention network Weakness, and whole classification accuracy and classification effectiveness can be improved using channel pyramid attention network, in this way, this reality Apply video classification methods in example single label, multi-tag, short-sighted frequency, long video, weak Temporal dependency, strong Temporal dependency video Under scene of classifying, all obtain well as a result, also, this method application can reduce training for different classifications scene and Tuning time, the more succinct intelligence of whole flow process save human cost.

Visual classification device provided in an embodiment of the present invention is illustrated below.

Fig. 6 is participated in, shows the structural block diagram of visual classification device 600 provided in an embodiment of the present invention in figure.Such as Fig. 6 institute Show, visual classification device 600 includes:

First obtains module 601, for obtaining the fisrt feature sequence of video to be sorted；Wherein, in fisrt feature sequence Feature arrange sequentially in time；

Second obtains module 602, for obtaining target gold for fisrt feature sequence inputting target pyramid attention network First output result of word tower attention network output；

Third obtains module 603, for exporting according to first as a result, obtaining object vector；

Categorization module 604, for classifying to video to be sorted according to object vector.

Optionally, third obtains module 603, comprising:

First obtains unit, for exporting each second feature sequence inputting channel pyramid in result by first respectively Attention network is exported respectively with to obtain channel pyramid attention network, and each second feature sequence is corresponding second defeated Result out；

Second obtaining unit, for being exported according to each second feature sequence corresponding second as a result, obtaining object vector；

Optionally, in any second output result further include each subcharacter in each subcharacter sequence included by it Corresponding weight；

Second obtaining unit, comprising:

First obtains subelement, for exporting each subcharacter sequence in result for each second, according to therein Each subcharacter and respective weights are weighted summation, obtain corresponding feature vector；

Second obtains subelement, for carrying out splicing operation, obtaining according to the corresponding feature vector of all subcharacter sequences Splice vector；

Subelement is determined, for vector will to be spliced as object vector.

Optionally, categorization module 604 are specifically used for:

Optionally, at least two fisrt feature sequences include first object characteristic sequence, the second target signature sequence and Three target signature sequences；Wherein,

Electronic equipment provided in an embodiment of the present invention is illustrated below.

Referring to Fig. 7, the structural schematic diagram of electronic equipment 700 provided in an embodiment of the present invention is shown in figure.Such as Fig. 7 institute Show, electronic equipment 700 includes: processor 701, memory 703, user interface 704 and bus interface.

Processor 701 executes following process for reading the program in memory 703:

Obtain the fisrt feature sequence of video to be sorted；Wherein, the feature in fisrt feature sequence is arranged sequentially in time Column；

By fisrt feature sequence inputting target pyramid attention network, the output of target pyramid attention network is obtained First output result；

According to the first output as a result, obtaining object vector；

According to object vector, classify to video to be sorted.

In Fig. 7, bus architecture may include the bus and bridge of any number of interconnection, specifically be represented by processor 701 One or more processors and the various circuits of memory that represent of memory 703 link together.Bus architecture can be with Various other circuits of such as peripheral equipment, voltage-stablizer and management circuit or the like are linked together, these are all these Well known to field, therefore, it will not be further described herein.Bus interface provides interface.For different users Equipment, user interface 704, which can also be, external the interface for needing equipment is inscribed, and the equipment of connection includes but is not limited to small key Disk, display, loudspeaker, microphone, control stick etc..

Processor 701, which is responsible for management bus architecture and common processing, memory 703, can store processor 701 and is holding Used data when row operation.

Optionally, processor 701 are specifically used for:

Processor 701, is specifically used for:

Vector will be spliced as object vector.

Optionally, processor 701 are specifically used for:

Preferably, the embodiment of the present invention also provides a kind of electronic equipment, including processor 701, and memory 703 is stored in On memory 703 and the computer program that can run on the processor 701, the computer program are executed by processor 701 Each process of the above-mentioned video classification methods embodiment of Shi Shixian, and identical technical effect can be reached, to avoid repeating, here It repeats no more.

The embodiment of the present invention also provides a kind of computer readable storage medium, and meter is stored on computer readable storage medium Calculation machine program, the computer program realize each process of above-mentioned video classification methods embodiment, and energy when being executed by processor Reach identical technical effect, to avoid repeating, which is not described herein again.Wherein, the computer readable storage medium, such as only Read memory (Read-Only Memory, abbreviation ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic or disk etc..

The embodiment of the present invention is described with above attached drawing, but the invention is not limited to above-mentioned specific Embodiment, the above mentioned embodiment is only schematical, rather than restrictive, those skilled in the art Under the inspiration of the present invention, without breaking away from the scope protected by the purposes and claims of the present invention, it can also make very much Form belongs within protection of the invention.

Claims

1. a kind of video classification methods, which is characterized in that the described method includes:

Obtain the fisrt feature sequence of video to be sorted；Wherein, the feature in the fisrt feature sequence is arranged sequentially in time Column；

By the fisrt feature sequence inputting target pyramid attention network, it is defeated to obtain the target pyramid attention network The first output result out；

According to first output as a result, obtaining object vector；

According to the object vector, classify to the video to be sorted.

2. the method according to claim 1, wherein

The target pyramid attention network is time pyramid attention network；

It include M characteristic sequence set of time scale inequality, each characteristic sequence set in the first output result Each second feature Sequence composition divided by the fisrt feature sequence according to corresponding time scale, each feature sequence Column set in the second feature sequence arrange sequentially in time, the feature in each second feature sequence according to when Between sequentially arrange, M is integer greater than 1.

3. according to the method described in claim 2, the video length of the video to be sorted is got over it is characterized in that, the value of M is bigger It is long.

4. according to the method described in claim 2, it is characterized in that, it is described according to it is described first output as a result, obtain target to Amount, comprising:

Each of result second feature sequence inputting channel pyramid attention network is exported by described first respectively, with Obtain what the channel pyramid attention network exported respectively, each second feature sequence corresponding second exports knot Fruit；

It is exported according to each second feature sequence corresponding second as a result, obtaining object vector；

It wherein, include N number of son spy of feature fine granularity inequality in the corresponding second output result of any second feature sequence Arrangement set is levied, each subcharacter arrangement set is divided by a second feature sequence according to individual features fine granularity Each subcharacter Sequence composition cut, the subcharacter in each subcharacter sequence arrange sequentially in time, N be greater than 1 integer.

5. according to the method described in claim 4, the video length of the video to be sorted is got over it is characterized in that, the value of N is bigger It is long.

6. according to the method described in claim 4, it is characterized in that, further including included by it in any second output result Each of each of the subcharacter sequence corresponding weight of the subcharacter；

It is described to be exported according to each second feature sequence corresponding second as a result, obtaining object vector, comprising:

For each second output each of result subcharacter sequence, according to each subcharacter therein with And respective weights, it is weighted summation, obtains corresponding feature vector；

Using the splicing vector as object vector.

7. the method according to claim 1, wherein described according to the object vector, to the view to be sorted Frequency is classified, comprising:

The object vector is inputted into fully-connected network, to obtain the fully-connected network output, the video to be sorted Classification results.

8. the method according to claim 1, wherein the quantity of the fisrt feature sequence is at least two, often The corresponding characteristic type inequality of a fisrt feature sequence.

9. according to the method described in claim 8, it is characterized in that, at least two fisrt feature sequences include first object Characteristic sequence, the second target signature sequence and third target signature sequence；Wherein,

The corresponding characteristic type of the first object characteristic sequence is characteristics of image type, and the second target signature sequence is corresponding Characteristic type be Optical-flow Feature type, the corresponding characteristic type of the third target signature sequence is phonetic feature type.

10. a kind of visual classification device, which is characterized in that described device includes:

First obtains module, for obtaining the fisrt feature sequence of video to be sorted；Wherein, the spy in the fisrt feature sequence Sign arranges sequentially in time；

Second obtains module, for obtaining the target for the fisrt feature sequence inputting target pyramid attention network First output result of pyramid attention network output；

11. device according to claim 10, which is characterized in that

The target pyramid attention network is time pyramid attention network；

12. device according to claim 11, which is characterized in that the value of M is bigger, the video length of the video to be sorted It is longer.

13. device according to claim 11, which is characterized in that the third obtains module, comprising:

First obtains unit, for respectively by each of the first output result second feature sequence inputting channel gold Word tower attention network is exported respectively with to obtain the channel pyramid attention network, each second feature sequence Corresponding second output result；

14. device according to claim 13, which is characterized in that the value of N is bigger, the video length of the video to be sorted It is longer.

15. device according to claim 13, which is characterized in that further include that it is wrapped in any second output result Each of include each of the subcharacter sequence corresponding weight of the subcharacter；

Second obtaining unit, comprising:

First obtains subelement, for exporting each of the result subcharacter sequence for each described second, according to it Each of the subcharacter and respective weights, be weighted summation, obtain corresponding feature vector；

Subelement is determined, for using the splicing vector as object vector.

16. device according to claim 10, which is characterized in that the categorization module is specifically used for:

17. device according to claim 10, which is characterized in that the quantity of the fisrt feature sequence is at least two, The corresponding characteristic type inequality of each fisrt feature sequence.

18. device according to claim 17, which is characterized in that at least two fisrt feature sequences include the first mesh Mark characteristic sequence, the second target signature sequence and third target signature sequence；Wherein,

19. a kind of electronic equipment, which is characterized in that including processor, memory is stored on the memory and can be described The computer program run on processor is realized when the computer program is executed by the processor as in claim 1 to 9 The step of described in any item video classification methods.

20. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program, the computer program realize video classification methods as claimed in any one of claims 1-9 wherein when being executed by processor The step of.