CN115861901A

CN115861901A - Video classification method, device, equipment and storage medium

Info

Publication number: CN115861901A
Application number: CN202211721670.3A
Authority: CN
Inventors: 骆剑平; 杨玉琪
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-03-28
Anticipated expiration: 2042-12-30
Also published as: CN115861901B

Abstract

The embodiment of the disclosure provides a video classification method, a video classification device, video classification equipment and a storage medium. The method comprises the following steps: acquiring a video to be classified; the content of the video to be classified comprises behavior actions of at least one target object; inputting a first video frame corresponding to a video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are in cascade connection. In the embodiment, by means of the two-way excitation channel grouping layer, huge time consumption and occupation of storage resources of optical flow calculation are avoided, and difficulties caused by independent training of a multi-stream network are avoided, the calculated amount can be greatly reduced, and meanwhile, the reasoning speed and the classification accuracy are further improved.

Description

Video classification method, device, equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the field of artificial intelligence, in particular to a video classification method, a video classification device, video classification equipment and a storage medium.

Background

One of the goals of artificial intelligence is: a machine is built that can accurately understand human behavior and intent to better serve humans. It is the problem of behavior recognition that needs to be studied and discussed to build a model that can understand human behavior.

When human behavior recognition is performed on a video, due to factors such as richness and complexity of human behavior, view occlusion, and background clutter, it is more difficult and challenging to recognize human behavior in an image than to recognize human behavior in an image alone. One of the mainstream techniques of the human behavior recognition method is a deep learning technique. Currently, mainstream human behavior recognition technologies based on deep learning can be divided into two types: one method is to independently learn the characteristics of space, continuous light flow and the like through a double-flow network and perform characteristic fusion in the later stage; and the other method is to extract the context relationship information between adjacent frames in the video frame by high-dimensional convolution modeling time dimension.

However, in the multi-stream network, each branch independently extracts features and then performs feature fusion in the training process, so that end-to-end training is not performed, the training difficulty is high, the process of calculating inter-frame optical flow information is time-consuming, the extracted optical flow features must be stored in a disk, and the requirements on storage cost and calculation cost are high; the high-dimensional convolution, such as the 3-dimensional convolution, has a large number of parameters and calculations per se, and can only learn local information of the video. In the practical application process, the behavior characteristics are directly extracted through the 3-dimensional convolution neural network, and the problems of gradient disappearance, gradient explosion, overfitting and the like are easily caused.

Disclosure of Invention

The embodiment of the disclosure provides a video classification method, a video classification device, video classification equipment and a storage medium, which can improve the speed and the precision of video classification.

In a first aspect, an embodiment of the present disclosure provides a video classification method, including: acquiring a video to be classified; the content of the video to be classified comprises behavior actions of at least one target object; inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are in cascade connection.

In a second aspect, an embodiment of the present disclosure further provides a video classification apparatus, including: the video to be classified acquisition module is used for acquiring videos to be classified; the content of the video to be classified comprises behavior actions of at least one target object; the action classification result obtaining module is used for inputting the first video frame corresponding to the video to be classified into a target video classification model and obtaining an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are in cascade connection.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, where the electronic device includes:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a video classification method as described in embodiments of the disclosure.

In a fourth aspect, the disclosed embodiments also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are used to perform a video classification method according to the disclosed embodiments.

According to the technical scheme of the embodiment of the disclosure, videos to be classified are obtained; the content of the video to be classified comprises behavior actions of at least one target object; inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are in cascade connection. According to the embodiment of the disclosure, through a two-way excitation channel grouping layer, not only are key motion information among video frames, time dependency among channels and long-distance video space-time information utilized, but also end-to-end efficient video classification is realized with a small number of input frames. In the embodiment, by means of the two-way excitation channel grouping layer, huge time consumption and occupation of storage resources of optical flow calculation are avoided, and difficulties caused by independent training of a multi-stream network are avoided, the calculated amount can be greatly reduced, and meanwhile, the reasoning speed and the classification accuracy are further improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a schematic flow chart of a video classification method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a video classification method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a network structure of a bottleneck unit according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a video classification apparatus according to an embodiment of the disclosure;

fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and the embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein is intended to be open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

It will be appreciated that the data involved in the subject technology, including but not limited to the data itself, the acquisition or use of the data, should comply with the requirements of the corresponding laws and regulations and related regulations.

Fig. 1 is a flowchart illustrating a video classification method provided by an embodiment of the present disclosure, where the embodiment of the present disclosure is applicable to a video classification situation, for example, classifying behaviors of a target object in a video, and the method may be executed by a video classification apparatus, where the apparatus may be implemented in a form of software and/or hardware, and optionally, implemented by an electronic device, where the electronic device may be a mobile terminal, a PC terminal, or a server.

As shown in fig. 1, the method includes:

and S110, acquiring the video to be classified.

The content of the video to be classified comprises behavior actions of at least one target object. The target object may be a person, an animal, or the like, the target object is a person as an example, and the behavior of the target object may be a "door opening" motion, a "door closing" motion, or the like. The present embodiment does not limit this to the number of behavior actions, nor does it limit the type of behavior actions.

S120, inputting the first video frame corresponding to the video to be classified into the target video classification model, and obtaining an action classification result corresponding to the video to be classified.

The target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus module, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus module are in cascade connection.

In this embodiment, the duration of classifying the video may be any, which is not limited in this embodiment, and the number of the first video frames may also be any, which is not limited in this embodiment. In this embodiment, a frame extraction may be performed on a video to be classified through a script, so as to obtain a first video frame. For example, the duration of the video to be classified is 3 seconds, and the first video frame may be 80 frames. And after the first video frame is obtained, inputting the first video frame into a target video classification model to obtain an action classification result corresponding to the classified video.

The target video classification model may be a time series Segment network (TSN) based model, the input of which is only video frames (images), and the backbone network of the TSN network may be represented by a ResNet50 network. It should be noted that the two-way stimulation path packet layer can be understood as an improved layer for the ResNet50 network.

Optionally, the step of inputting the first video frame corresponding to the video to be classified into the target video classification model to obtain the action classification result corresponding to the video to be classified includes: the sparse sampling layer randomly samples the first video frame to obtain a second video frame, and performs data enhancement processing on the second video frame to obtain an enhanced second video frame; the two-way excitation channel grouping layer carries out deep feature extraction based on the enhanced second video frame to obtain deep features; the segmentation consensus layer calculates the average score of each video frame corresponding to the video to be classified in the same category according to the deep features; converting the average score into a probability value based on a set function; and based on the probability values of the video to be classified on all the categories, taking the action category corresponding to the maximum probability value as an action classification result, and outputting the action classification result.

Specifically, the sparse sampling layer divides the first video frame into several segments by a sparse sampling strategy, and randomly samples one frame (picture) from each segment. Illustratively, the first video frame is 80 frames, the 80 frames (pictures) are divided into 8 segments by a sparse sampling strategy, each segment comprises 10 frames (pictures), and one frame is randomly sampled from each segment, namely 8 frames (pictures). The 8 frames (pictures) may be the second video frame. After the second video frame is obtained, the second video frame is processed through a data enhancement strategy, and the height and width of the video frame are adjusted to be uniform in size, for example, 224 × 224, so that the enhanced second video frame is obtained. Wherein the data enhancement includes random flipping and/or angle cropping operations. The video frames include time information, for example, the second video frame is 8 frames, which can be understood as 8 time information.

In this embodiment, after the deep layer features are obtained by the dual excitation channel packet layer, the deep layer features are mapped to a specific sample space (the number of the sample space is the total number of categories of the data set) by the full connection layer, so as to obtain the features of the full connection layer. And the segmentation consensus layer calculates the average score of each video frame corresponding to the video to be classified in the same category according to the characteristics of the full-connection layer through a segmentation consensus function. Wherein, the segment consensus function is Average Pooling mean function.

Illustratively, if the category is 2 category, the second video frame is 3 frames, the prediction score of the first frame picture on the category a is 0.5, and the prediction score on the category B is 0.3; the prediction score of the second frame picture in the category a is 0.4, the prediction score of the third frame picture in the category B is 0.6, and the prediction score of the third frame picture in the category B is 0.4, so that the average score of the 3 video frames in the category a is (0.5 +0.4+ 0.6)/3 =0.5. The average score of 3 video frames on the B category is (0.3 +0.5+ 0.4)/3 =0.4.

In this embodiment, after the average score of each video frame in the same category is calculated, the average score is converted into a probability value by a softmax function (normalization function), so as to obtain the probability value of the video to be classified in each category, and the category corresponding to the maximum probability value is used as the action classification result, and the action classification result is output.

Optionally, the two-way excitation channel grouping layer includes at least four two-way excitation channel grouping modules, and an input of a subsequent two-way excitation channel grouping module in an adjacent two-way excitation channel grouping module is an output of a previous two-way excitation channel grouping module. Optionally, the two-channel excitation channel grouping layer performs deep feature extraction based on the enhanced second video frame to obtain deep features, including: and the double-channel excitation channel grouping module is used for carrying out deep layer feature extraction on the basis of the enhanced second video frame to obtain deep layer sub-features.

It should be noted that, the number of the two-way excitation channel grouping modules may be set according to a backbone network of the TSN network, for example, if the backbone network of the TSN network can be represented by a ResNet50 network, the two-way excitation channel grouping layer may include 4 two-way excitation channel grouping modules. For the input of the two-way excitation channel grouping module, except that the first two-way excitation channel grouping module is the enhanced second video frame, the input of the next two-way excitation channel grouping module in the adjacent two-way excitation channel grouping module is the output of the previous two-way excitation channel grouping module. As shown in fig. 2, fig. 2 is a schematic view of a video classification method according to an embodiment of the present invention. The two-way excitation channel grouping layer comprises 4 two-way excitation channel grouping modules, and the category distribution represents the distribution of scores of videos to be classified in each category.

Optionally, the two-way excitation channel grouping module includes a plurality of bottleneck units, each bottleneck unit is connected in cascade, and an input of a subsequent bottleneck unit in adjacent bottleneck units is an output of a previous bottleneck unit; the bottleneck unit comprises a first two-dimensional convolution subunit, a motion excitation subunit, a channel grouping subunit and a second two-dimensional convolution subunit; the input of the motion excitation subunit and the input of the channel excitation subunit are both the output of the first two-dimensional convolution subunit, the output of the motion excitation subunit and the output of the channel excitation subunit are added, the added output is used as the input of the channel grouping subunit, and the output of the channel grouping subunit is the input of the second two-dimensional convolution subunit; the dual excitation channel grouping module performs deep feature extraction based on the enhanced second video frame to obtain deep sub-features, and the method comprises the following steps: if the bottleneck unit to which the first two-dimensional convolution subunit belongs is the first bottleneck unit, the first two-dimensional convolution subunit performs feature extraction based on the enhanced second video frame to obtain a first convolution feature; otherwise, the first two-dimensional convolution subunit performs feature extraction based on the output of the previous bottleneck unit to obtain a first convolution feature; the motion excitation subunit performs feature extraction based on the first convolution features to obtain motion features; the channel excitation subunit performs feature extraction based on the first convolution features to obtain channel features; the channel grouping subunit extracts the features based on the feature obtained by adding the motion features and the channel features to obtain long-distance space-time features; and the second two-dimensional convolution subunit performs feature extraction based on the long-distance space-time feature to obtain a second convolution feature.

It should be noted that, for the number of bottleneck units, the setting may be performed according to a backbone network of the TSN network, for example, if the backbone network of the TSN network can be represented by a ResNet50 network, the number of bottleneck units of the 4 two-way excitation channel packet modules is respectively: 3. 4, 6 and 3, and the bottleneck units are connected in cascade. The input of a subsequent bottleneck unit in adjacent bottleneck units is the output of the previous bottleneck unit.

In this embodiment, the bottleneck unit includes a first two-dimensional convolution subunit, a motion excitation subunit, a channel grouping subunit, and a second two-dimensional convolution subunit; the first two-dimensional convolution sub-unit and the second two-dimensional convolution sub-unit can both be two-dimensional convolutions with a convolution kernel size of 1*1.

Specifically, firstly, quaternary operation framing preprocessing is performed on the enhanced second video frame, then the preprocessed features are input into a first dual-channel excitation channel grouping module for feature extraction, then the first dual-channel excitation channel grouping module outputs the feature to a second dual-channel excitation channel grouping module, and so on until the last dual-channel excitation channel grouping module performs feature extraction. The quad operation is framed by Conv-BN-ReLU-MaxPool, where Conv may be a convolution kernel with size 7 × 7 and step size 2, BN is Batch Normalization (BN) processing, reLU is a modified linear cell activation function, maxPool is a pooling kernel with size 3 × 3 and step size 2. Where the feature sequence shape of the second video frame may be [ N, T, C, H, W ], where N is the batch size, e.g., 128, T and C represent the temporal dimension and the channel dimension, respectively, and H and W are the height and width of the spatial shape.

Specifically, for a first two-dimensional convolution subunit in a bottleneck unit in a two-way excitation channel grouping module, if the bottleneck unit to which the first two-dimensional convolution subunit belongs is the first bottleneck unit, the first two-dimensional convolution subunit performs feature extraction based on the feature after quaternary operation framing preprocessing to obtain a first convolution feature; otherwise, the first two-dimensional convolution subunit performs feature extraction based on the output of the previous bottleneck unit to obtain a first convolution feature. After the first convolution characteristic is obtained, the motion excitation subunit performs characteristic extraction based on the first convolution characteristic to obtain a motion characteristic; the channel excitation subunit performs feature extraction based on the first convolution features to obtain channel features; the channel grouping subunit extracts the features based on the feature obtained by adding the motion features and the channel features to obtain long-distance space-time features; and the second two-dimensional convolution subunit performs feature extraction based on the long-distance space-time feature to obtain a second convolution feature. As shown in fig. 3, fig. 3 is a schematic diagram of a network structure of a bottleneck unit according to an embodiment of the present invention. In the figure, "+" indicates that the output (motion characteristic) of the motion excitation subunit and the output (channel characteristic) of the channel excitation subunit are added, and the channel grouping subunit performs characteristic extraction based on the characteristic obtained by adding the motion characteristic and the channel characteristic.

Optionally, the motion excitation subunit performs feature extraction based on the first convolution feature to obtain a motion feature, including: performing channel number compression on the first convolution characteristics through a third two-dimensional convolution to obtain channel compression characteristics; for the channel compression features at the adjacent moments, performing feature extraction on the channel compression features at the t +1 moment through a fourth convolution to obtain a fourth convolution feature; subtracting the fourth convolution characteristic from the channel compression characteristic at the time t to obtain a plurality of motion sub-characteristics; wherein t is a positive integer, and the value range of t is between a first set value and a second set value; splicing the plurality of motion sub-features on a time dimension to obtain a first complete motion sub-feature; setting the motion characteristic of the last moment as a third set value to obtain the motion sub-characteristic of the last moment; connecting the first complete motion sub-feature with the motion sub-feature at the last moment in series to obtain a second complete motion sub-feature; processing the second complete sub-motion characteristic through global average pooling to obtain a first pooled characteristic; adjusting the number of channels of the first pooling feature based on the fifth convolution to obtain an adjusted first pooling feature; performing feature extraction of an attention mechanism on the first pooled features to obtain enhanced motor features; and residual error connection is carried out on the basis of the first convolution characteristic and the enhanced motion sub-characteristic to obtain a motion characteristic.

Specifically, the motion excitation subunit performs feature extraction on the first volume feature, and the process of obtaining the motion feature is as follows: and compressing the first convolution characteristic to 1/16 of the original channel number by a third two-dimensional convolution (the two-dimensional convolution with the size of 1 multiplied by 1 and the step length of 1) so as to reduce the calculation cost, improve the calculation efficiency and obtain the channel compression characteristic. The formula is as follows:

X ^r ＝conv _red *X,X ^r ∈R ^{N×T×C/r×H×W} (1)

wherein, X _r Channel compression feature, X is the first convolution feature, conv _red And representing convolution operation for the third two-dimensional convolution. 1/r is the channel reduction ratio, which can be 16.

Specifically, a fourth two-dimensional convolution is applied to the channel compression characteristic at the time t +1 to obtain a fourth convolution characteristic, and then the channel compression characteristic at the time t is subtracted from the fourth convolution characteristic to obtain the motion sub-characteristic at the time t. And performing the operation on the channel compression characteristics at each adjacent moment to obtain a plurality of motion sub-characteristics. Wherein the first set value is 1, and the second set value is the number of the second video frames minus 1. The specific formula is as follows:

M(t)＝conv _trans *X ^r (t+1)-X ^r (t),1≤t≤T-1 (2)

wherein M (t) ∈ R ^N×C/r×H×W Representing the motion sub-feature at time T, T representing the second number of video frames minus 1, i.e. the second set value, conv _trans Is a two-dimensional convolution of size 3 x 3 with step size 1. And (3) executing a formula (2) on each two adjacent channel compression features to obtain t-1 motion feature representations, and splicing the motion sub-features on a time dimension to obtain a first complete motion sub-feature. In order to make the time dimension of the first complete motion sub-feature the same as the first convolution feature, the motion feature at the last time is set to be a third set value, and the motion sub-feature at the last time is obtained. Wherein the third setting value is 0, i.e. M (T) =0, and the first complete kinematic sub-feature and the kinematic sub-feature at the last moment are connected in series to obtain a second complete kinematic sub-feature, and a final second complete kinematic sub-feature M is constructed, i.e. M (T) = [ M (1), M (2), · · · · · ·, M (T) · M (1)]。

Since the goal of the motion excitation subunit is to excite the motion sensitive channel, the network is made more aware of the motion information without considering the detailed spatial layout. Thus, the second complete sub-motion feature can be processed by global average pooling, as follows:

M ^s ＝Pool(M),M ^s ∈R ^{N×T×C/r×1×1} (3)

wherein M is a second complete kinematic sub-feature, M ^s Representing a first pooling characteristic.

And then, adjusting the number of channels of the first pooled feature based on a fifth convolution to obtain the adjusted first pooled feature, so as to restore the number of channels of the first pooled feature to the original size. And obtains the motion attention weight by activating the function Sigmoid (). Wherein the fifth convolution is a two-dimensional convolution of size 1 × 1 with step size 1. The formula is as follows:

A＝2δ(conv _exp *M ^s )-1,A∈R ^{N×T×C×1×1} (4)

wherein, conv _exp For the fifth convolution, δ is the activation function, and a is the motor attention weight a, i.e., the motor attention mechanism weight.

Then, the first convolution feature is multiplied by the exercise attention mechanism weight, so that the enhanced exercise sub-feature can be obtained. And finally, residual error connection is carried out based on the first convolution characteristic and the enhanced motion sub-characteristic to obtain the motion characteristic, and original information is reserved and motion information is enhanced through the residual error connection. The formula is as follows:

wherein the content of the first and second substances,

an output of a motion actuated subcell, an indicates a multiplication by channel.

Optionally, the channel excitation subunit performs feature extraction based on the first convolution feature to obtain a channel feature, including: processing the first convolution features through global average pooling to obtain second pooling features; adjusting the number of channels of the second pooling characteristic through a sixth convolution to obtain an adjusted second pooling characteristic; interchanging the positions of the channel dimension and the time dimension of the adjusted second pooling characteristic to obtain a third pooling characteristic; wherein the second pooling characteristic comprises a batch size dimension, a time dimension, a channel dimension, a height dimension, and a width dimension; processing the third pooled features by a seventh convolution to obtain first channel sub-features; interchanging the positions of the channel dimension and the time dimension of the first channel sub-feature to obtain a second channel sub-feature; performing feature extraction of an attention mechanism on the second channel sub-features to obtain enhanced channel sub-features; and residual error connection is carried out based on the first convolution characteristic and the enhanced channel sub-characteristic to obtain a channel characteristic.

Specifically, the channel excitation subunit performs feature extraction on the first convolution feature, and a process of obtaining the channel feature is as follows: firstly, processing the first convolution feature through global average pooling to obtain a second pooling feature, wherein the formula is as follows:

where ": indicates all values of the feature, the first": indicates all values of the batch size, the second ": indicates all features in the time dimension, and the third": indicates all features in the channel dimension. F is the second pooling characteristic.

And then, compressing the number of channels of the second pooled feature by a sixth convolution, so as to obtain the adjusted second pooled feature, wherein the sixth convolution is a two-dimensional convolution with the size of 1 × 1. The formula is as follows:

F _r ＝K ₁ *F,F _r ∈R ^N,T,C/r,1,1 (7)

wherein, K ₁ Is the sixth convolution, 1/r is the compression ratio, r is 16.

And then, interchanging the positions of the channel dimension and the time dimension of the adjusted second pooling feature to obtain a third pooling feature so as to support time reasoning. The shape of the third pooling feature is [ N, C/r, T,1,1]And is recorded as

And then, processing the third pooled feature by a seventh convolution to obtain a first channel sub-feature. Wherein the seventh convolution is a one-dimensional convolution with a kernel size of 3. The formula is as follows:

wherein, K ₂ In order to be the seventh convolution,

is a first channel sub-feature.

Then, will

The dimension is adjusted to [ N, T, C/r,1,1]Obtaining a second channel sub-feature and recording it as F _temp 。

And then, performing channel activation on the second channel sub-feature through an eighth convolution and an activation function Sigmoid to obtain a channel attention mechanism weight. Wherein the eighth convolution is a two-dimensional convolution of size 1 × 1. The formula is as follows:

wherein, K ₃ For the eighth convolution, F _o The eighth convolved feature is applied to the second channel sub-feature, and M is a channel attention mechanism weight.

Finally, performing feature extraction of an attention mechanism on the second channel sub-features to obtain enhanced channel sub-features; and performing residual error connection based on the first convolution characteristic and the enhanced channel sub-characteristic to obtain a channel characteristic. The formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

the output of the subunit is stimulated for the channel.

Optionally, the channel grouping subunit performs feature extraction based on the feature obtained by adding the motion feature and the channel feature to obtain a long-distance spatiotemporal feature, including: dividing the feature obtained by adding the motion feature and the channel feature on the channel dimension to obtain a set number of long-distance space-time sub-features; for the second long-distance space-time sub-feature, processing is sequentially carried out through channel-level time sequence sub-convolution and space sub-convolution to obtain a new second long-distance space-time sub-feature; for the Nth long-distance space-time sub-feature, adding the new previous distance space-time sub-feature and the Nth long-distance space-time sub-feature to obtain a residual error feature; processing the residual error characteristics through channel-level time sequence sub-convolution and space sub-convolution in sequence to obtain new Nth long-distance space-time sub-characteristics; wherein N is a positive integer greater than 2; and splicing the first long-distance space-time sub-feature, the new second long-distance space-time sub-feature and the new Nth long-distance space-time sub-feature on the channel dimension to obtain the long-distance space-time feature.

Specifically, the channel grouping subunit performs feature extraction based on the feature obtained by adding the motion feature and the channel feature, and the process of obtaining the long-distance space-time feature is as follows: firstly, the feature obtained by adding the motion feature and the channel feature is divided on the channel dimension to obtain a set number of long-distance space-time sub-features. Wherein the set number is 4. The shape of each long-distance space-time sub-feature is [ N, T, C/4, H, W ],

for the first long-distance spatio-temporal sub-feature, the formula is as follows:

wherein, when i =1, X _i Representing a first long-range spatio-temporal sub-feature,

is a new first long-distance spatio-temporal sub-feature. I.e., the new first long-distance spatio-temporal sub-feature is identical to the first long-distance spatio-temporal sub-feature, so the receptive field of the new first long-distance spatio-temporal sub-feature is 1 × 1 × 1.

And for the second long-distance space-time sub-feature, processing is sequentially performed through channel-level time sequence sub-convolution and space sub-convolution to obtain a new second long-distance space-time sub-feature. Wherein, the channel-level time sequence sub-convolution represents a one-dimensional convolution with a size of 3, and the spatial sub-convolution represents a two-dimensional convolution with a size of 3 × 3. The formula is as follows:

wherein, conv _temp For channel-level time-sequential sub-convolution, conv _spa Is a spatial sub-convolution. When i =2, X _i Representing a second long-distance spatio-temporal sub-feature.

Is a new first long-distance spatio-temporal sub-feature.

And for the third long-distance space-time sub-feature, adding the third long-distance space-time sub-feature and the new second long-distance space-time sub-feature to obtain a residual error feature, and sequentially processing the residual error feature through channel-level time sequence sub-convolution and space sub-convolution to obtain a new third long-distance space-time sub-feature.

And for the fourth long-distance space-time sub-feature, adding the fourth long-distance space-time sub-feature and the new third long-distance space-time sub-feature to obtain a residual error feature, and sequentially processing the residual error feature through channel-level time sequence sub-convolution and space sub-convolution to obtain a new fourth long-distance space-time sub-feature. The formula is as follows:

represents a new previous distance spatio-temporal sub-feature, when i =3, based on the evaluation of the value of the parameter>

Represents a new third long-distance spatio-temporal sub-feature that, when i =4, </or>

Representing a new fourth long-distance spatio-temporal sub-feature.

In this embodiment, for the nth long-distance spatio-temporal sub-feature, the channel grouping sub-unit is converted from a parallel structure to a stacked structure by adding the new previous-distance spatio-temporal sub-feature to the nth long-distance spatio-temporal sub-feature, i.e. adding the residual connection. By residual concatenation, the receptive field of the new fourth long-distance spatio-temporal sub-feature is expanded by three times. I.e. different long-distance spatio-temporal sub-features have different receptive fields.

And finally, splicing the first long-distance space-time sub-feature, the new second long-distance space-time sub-feature and the new Nth long-distance space-time sub-feature in a channel dimension in a serial connection mode to obtain the long-distance space-time feature. The formula is as follows:

wherein the content of the first and second substances,

and &>

Respectively representing a new first long-distance spatio-temporal sub-feature, a new second long-distance spatio-temporal sub-feature, a new third long-distance spatio-temporal sub-feature and a new fourth long-distance spatio-temporal sub-feature, X ^o Is a long-distance spatiotemporal feature. The long-distance spatiotemporal features capture spatiotemporal information at different times. />

It should be noted that the second bottleneck unit and the third bottleneck unit in the first dual excitation channel grouping module, and the subsequent second dual excitation channel grouping module, the third dual excitation channel grouping module, and the fourth dual excitation channel grouping module perform further feature extraction, and the extraction process is consistent with the process of the above formula (1) -formula (15), and the difference is that: the number of channels and the dimensions of the shape are constantly changing.

In this embodiment, the training mode of the target video classification model is as follows: training the target video classification model based on a training set to obtain a trained target video classification model; and testing the trained target image processing model based on the test set. The loss function in this embodiment is a cross-entropy loss function.

In the embodiment, the key motion information, the channel information and the long-distance space-time information in the classification process are enhanced through the two-way excitation channel packet layer. When a motion excitation subunit is constructed, enhancing motion information on a time dimension by segmenting features on the time dimension and carrying out adjacent frame difference, carrying out feature activation on local motion between adjacent frames by using a Sigmoid activation function to efficiently extract motion features of a short time sequence between frames of a second video, and simultaneously storing static scene information of an original frame (a first convolution feature) by using a residual error structure; when a channel excitation subunit is constructed, representing the time information of channel characteristics by one-dimensional convolution, and also using a Sigmoid activation function to adaptively calibrate the channel characteristics to represent the time dependence among channels; when the channel grouping subunit is constructed, the long-distance space-time sub-features and the corresponding local convolution (channel-level time sequence sub-convolution and space sub-convolution) are divided into a group of subsets, and a multi-level residual structure is added, so that the original multi-level cascade structure is converted into a multi-level parallel structure, the multi-scale representation capability of a convolution kernel is improved, and the equivalent receptive field of a time dimension is correspondingly enlarged.

In the embodiment, the motion excitation subunit extracts the motion information of short time sequences between adjacent video frames, the channel excitation subunit adaptively adjusts the time dependency relationship between channels, the channel grouping subunit extracts the spatio-temporal information of long time sequences, the three subunits are integrated in the bottleneck unit, and the efficient target video classification model is constructed by stacking the bottleneck units.

According to the technical scheme of the embodiment of the disclosure, videos to be classified are obtained; the content of the video to be classified comprises behavior actions of at least one target object; inputting a first video frame corresponding to a video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are in cascade connection. According to the embodiment of the disclosure, through a two-way excitation channel grouping layer, not only are key motion information among video frames, time dependency among channels and long-distance video space-time information utilized, but also end-to-end efficient video classification is realized with a small number of input frames. In the embodiment, by means of the two-way excitation channel grouping layer, huge time consumption and occupation of storage resources of optical flow calculation are avoided, and difficulties caused by independent training of a multi-stream network are avoided, the calculated amount can be greatly reduced, and meanwhile, the reasoning speed and the classification accuracy are further improved.

Fig. 4 is a schematic structural diagram of a video classification apparatus according to an embodiment of the disclosure, and as shown in fig. 4, the apparatus includes: a video to be classified acquisition module 410 and an action classification result acquisition module 420;

a to-be-classified video obtaining module 410, configured to obtain a to-be-classified video; the content of the video to be classified comprises behavior actions of at least one target object;

an action classification result obtaining module 420, configured to input the first video frame corresponding to the video to be classified into a target video classification model, and obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are in cascade connection.

According to the technical scheme of the embodiment of the disclosure, a video to be classified is obtained through a video to be classified obtaining module; the content of the video to be classified comprises behavior actions of at least one target object; inputting a first video frame corresponding to the video to be classified into a target video classification model through an action classification result obtaining module, and obtaining an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are in cascade connection. According to the embodiment of the disclosure, through a two-way excitation channel grouping layer, not only are key motion information among video frames, time dependency among channels and long-distance video space-time information utilized, but also end-to-end efficient video classification is realized with a small number of input frames. In the embodiment, by means of the two-way excitation channel grouping layer, huge time consumption and occupation of storage resources of optical flow calculation are avoided, and difficulties caused by independent training of a multi-stream network are avoided, the calculated amount can be greatly reduced, and meanwhile, the reasoning speed and the classification accuracy are further improved.

Optionally, the action classification result obtaining module is specifically configured to: the sparse sampling layer carries out random sampling on the first video frame to obtain a second video frame, and carries out data enhancement processing on the second video frame to obtain an enhanced video frame; the data enhancement comprises random flipping and/or angle cropping operations; wherein the video frame includes time information; the two-way excitation channel grouping layer carries out deep feature extraction based on the enhanced second video frame to obtain deep features; the segmentation consensus layer calculates the average score of each video frame corresponding to the video to be classified in the same category according to the deep features; converting the average score to a probability value based on a set function; and based on the probability values of the video to be classified in all the categories, taking the action category corresponding to the maximum probability value as an action classification result, and outputting the action classification result.

Optionally, the two-way excitation channel grouping layer includes at least four two-way excitation channel grouping modules, and an input of a subsequent two-way excitation channel grouping module in an adjacent two-way excitation channel grouping module is an output of a previous two-way excitation channel grouping module. The action classification result obtaining module is further configured to: and the double-channel excitation channel grouping module is used for carrying out deep layer feature extraction on the basis of the enhanced second video frame to obtain deep layer sub-features.

Optionally, the two-way excitation channel grouping module includes a plurality of bottleneck units, each bottleneck unit is connected in cascade, and an input of a subsequent bottleneck unit in adjacent bottleneck units is an output of a previous bottleneck unit; the bottleneck unit comprises a first two-dimensional convolution subunit, a motion excitation subunit, a channel grouping subunit and a second two-dimensional convolution subunit; the input of the motion excitation subunit and the input of the channel excitation subunit are both the output of the first two-dimensional convolution subunit, the output of the motion excitation subunit and the output of the channel excitation subunit are added, the added output is used as the input of the channel grouping subunit, and the output of the channel grouping subunit is the input of the second two-dimensional convolution subunit.

Optionally, the action classification result obtaining module is further configured to: if the bottleneck unit to which the first two-dimensional convolution subunit belongs is the first bottleneck unit, the first two-dimensional convolution subunit performs feature extraction based on the enhanced second video frame to obtain a first convolution feature; otherwise, the first two-dimensional convolution subunit performs feature extraction based on the output of the previous bottleneck unit to obtain a first convolution feature; the motion excitation subunit performs feature extraction based on the first convolution features to obtain motion features; the channel excitation subunit performs feature extraction based on the first convolution features to obtain channel features; the channel grouping subunit extracts the features based on the features obtained by adding the motion features and the channel features to obtain long-distance space-time features; and the second two-dimensional convolution subunit performs feature extraction based on the long-distance space-time feature to obtain a second convolution feature.

Optionally, the action classification result obtaining module is further configured to: performing channel number compression on the first convolution characteristics through a third two-dimensional convolution to obtain channel compression characteristics; for the channel compression features at the adjacent moments, performing feature extraction on the channel compression features at the t +1 moment through a fourth two-dimensional convolution to obtain a fourth convolution feature; subtracting the fourth convolution characteristic from the channel compression characteristic at the time t to obtain a plurality of motion sub-characteristics; wherein t is a positive integer and the value range of t is between a first set value and a second set value; splicing the plurality of motion sub-features on a time dimension to obtain a first complete motion sub-feature; setting the motion characteristic of the last moment as a third set value to obtain the motion sub-characteristic of the last moment; connecting the first complete motion sub-feature and the motion sub-feature of the last moment in series to obtain a second complete motion sub-feature; processing the second complete sub-motion characteristics through global average pooling to obtain first pooled characteristics; adjusting the number of channels of the first pooling feature based on a fifth convolution to obtain an adjusted first pooling feature; performing feature extraction of an attention mechanism on the first pooled features to obtain enhanced motor sub-features; and residual error connection is carried out on the basis of the first convolution characteristic and the enhanced motion sub-characteristic, so as to obtain the motion characteristic.

Optionally, the action classification result obtaining module is further configured to: processing the first convolution features through global average pooling to obtain second pooled features; adjusting the number of channels of the second pooling feature through a sixth convolution to obtain an adjusted second pooling feature; interchanging the positions of the channel dimension and the time dimension of the adjusted second pooling characteristic to obtain a third pooling characteristic; wherein the second pooling characteristic comprises a batch size dimension, a time dimension, a channel dimension, a height dimension, and a width dimension; processing the third pooling feature by a seventh convolution to obtain a first channel sub-feature; interchanging the positions of the channel dimension and the time dimension of the first channel sub-feature to obtain a second channel sub-feature; performing feature extraction of an attention mechanism on the second channel sub-features to obtain enhanced channel sub-features; and residual error connection is carried out on the basis of the first convolution characteristic and the enhanced channel sub-characteristic to obtain a channel characteristic.

Optionally, the action classification result obtaining module is further configured to: dividing the feature obtained by adding the motion feature and the channel feature on a channel dimension to obtain a set number of long-distance space-time sub-features; for the second long-distance space-time sub-feature, processing is sequentially carried out through channel-level time sequence sub-convolution and space sub-convolution to obtain a new second long-distance space-time sub-feature; for the Nth long-distance space-time sub-feature, adding the new previous distance space-time sub-feature and the Nth long-distance space-time sub-feature to obtain a residual error feature; processing the residual error characteristics through channel-level time sequence sub-convolution and space sub-convolution in sequence to obtain new Nth long-distance space-time sub-characteristics; wherein N is a positive integer greater than 2; and splicing the first long-distance space-time sub-feature, the new second long-distance space-time sub-feature and the new Nth long-distance space-time sub-feature on a channel dimension to obtain the long-distance space-time feature.

The video classification device provided by the embodiment of the disclosure can execute the video classification method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.

It should be noted that, the units and modules included in the apparatus are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are also only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the embodiments of the present disclosure.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. Referring now to fig. 5, a schematic diagram of an electronic device (e.g., the terminal device or the server in fig. 5) 500 suitable for implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An editing/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The electronic device provided by the embodiment of the present disclosure and the video classification method provided by the above embodiment belong to the same inventive concept, and technical details that are not described in detail in the embodiment can be referred to the above embodiment, and the embodiment and the above embodiment have the same beneficial effects.

The disclosed embodiments provide a computer storage medium having a computer program stored thereon, which when executed by a processor implements the video classification method provided by the above embodiments.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a video to be classified; the content of the video to be classified comprises behavior actions of at least one target object; inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are in cascade connection.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other combinations of features described above or equivalents thereof without departing from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method of video classification, comprising:

acquiring a video to be classified; the content of the video to be classified comprises behavior actions of at least one target object;

inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are in cascade connection.

2. The method according to claim 1, wherein inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified comprises:

the sparse sampling layer carries out random sampling on the first video frame to obtain a second video frame, and carries out data enhancement processing on the second video frame to obtain an enhanced second video frame; the data enhancement comprises random flipping and/or angle cropping operations; wherein the video frame includes time information;

the double-path excitation channel grouping layer carries out deep feature extraction on the basis of the enhanced second video frame to obtain deep features;

the segmentation consensus layer calculates the average score of each video frame corresponding to the video to be classified in the same category according to the deep features;

converting the average score to a probability value based on a set function;

and based on the probability values of the video to be classified in all the categories, taking the action category corresponding to the maximum probability value as an action classification result, and outputting the action classification result.

3. The method of claim 2, wherein the dual excitation channel grouping layer comprises at least four dual excitation channel grouping modules, and an input of a subsequent dual excitation channel grouping module in an adjacent dual excitation channel grouping module is an output of a previous dual excitation channel grouping module; and the double-path excitation channel grouping layer carries out deep feature extraction based on the enhanced second video frame to obtain deep features, and the deep feature extraction comprises the following steps:

and the double-channel excitation channel grouping module is used for carrying out deep layer feature extraction on the basis of the enhanced second video frame to obtain deep layer sub-features.

4. The method of claim 3, wherein the dual excitation channel grouping module comprises a plurality of bottleneck units, each bottleneck unit being connected in cascade, an input of a subsequent bottleneck unit in adjacent bottleneck units being an output of a previous bottleneck unit; the bottleneck unit comprises a first two-dimensional convolution subunit, a motion excitation subunit, a channel grouping subunit and a second two-dimensional convolution subunit; the input of the motion excitation subunit and the input of the channel excitation subunit are both the output of the first two-dimensional convolution subunit, the output of the motion excitation subunit and the output of the channel excitation subunit are added, the added output is used as the input of the channel grouping subunit, and the output of the channel grouping subunit is the input of the second two-dimensional convolution subunit; the dual excitation channel grouping module performs deep feature extraction based on the enhanced second video frame to obtain deep sub-features, and the method comprises the following steps:

if the bottleneck unit to which the first two-dimensional convolution subunit belongs is the first bottleneck unit, the first two-dimensional convolution subunit performs feature extraction based on the enhanced second video frame to obtain a first convolution feature;

otherwise, the first two-dimensional convolution subunit performs feature extraction based on the output of the previous bottleneck unit to obtain a first convolution feature;

the motion excitation subunit performs feature extraction based on the first convolution features to obtain motion features;

the channel excitation subunit performs feature extraction based on the first convolution features to obtain channel features;

the channel grouping subunit extracts the features based on the features obtained by adding the motion features and the channel features to obtain long-distance space-time features;

and the second two-dimensional convolution subunit performs feature extraction based on the long-distance space-time feature to obtain a second convolution feature.

5. The method of claim 4, wherein the motion excitation subunit performs feature extraction based on the first convolution feature to obtain a motion feature, and the method comprises:

performing channel number compression on the first convolution features through a third two-dimensional convolution to obtain channel compression features;

for the channel compression features at the adjacent moments, performing feature extraction on the channel compression features at the t +1 moment through a fourth two-dimensional convolution to obtain a fourth convolution feature;

subtracting the fourth convolution characteristic from the channel compression characteristic at the time t to obtain a plurality of motion sub-characteristics; wherein t is a positive integer, and the value range of t is between a first set value and a second set value;

splicing the plurality of motion sub-features on a time dimension to obtain a first complete motion sub-feature;

setting the motion characteristic of the last moment as a third set value to obtain the motion sub-characteristic of the last moment;

connecting the first complete motion sub-feature and the motion sub-feature of the last moment in series to obtain a second complete motion sub-feature;

processing the second complete sub-motion characteristics through global average pooling to obtain first pooled characteristics;

adjusting the number of channels of the first pooling feature based on a fifth convolution to obtain an adjusted first pooling feature;

performing feature extraction of an attention mechanism on the first pooled features to obtain enhanced motor sub-features;

and residual error connection is carried out on the basis of the first convolution characteristic and the enhanced motion sub-characteristic, so as to obtain a motion characteristic.

6. The method of claim 4, wherein the channel excitation subunit performs feature extraction based on the first convolution feature to obtain a channel feature, and the method comprises:

processing the first convolution features through global average pooling to obtain second pooled features;

adjusting the number of channels of the second pooling feature through a sixth convolution to obtain an adjusted second pooling feature;

interchanging the positions of the channel dimension and the time dimension of the adjusted second pooling characteristic to obtain a third pooling characteristic; wherein the second pooling characteristic comprises a batch size dimension, a time dimension, a channel dimension, a height dimension, and a width dimension;

processing the third pooling feature by a seventh convolution to obtain a first channel sub-feature;

interchanging the positions of the channel dimension and the time dimension of the first channel sub-feature to obtain a second channel sub-feature;

performing feature extraction of an attention mechanism on the second channel sub-features to obtain enhanced channel sub-features;

and residual error connection is carried out on the basis of the first convolution characteristic and the enhanced channel sub-characteristic to obtain a channel characteristic.

7. The method according to any one of claims 4-6, wherein the channel grouping subunit performs feature extraction based on the feature obtained by adding the motion feature and the channel feature to obtain a long-distance spatiotemporal feature, and comprises:

dividing the feature obtained by adding the motion feature and the channel feature on a channel dimension to obtain a set number of long-distance space-time sub-features;

for the second long-distance space-time sub-feature, processing is sequentially carried out through channel-level time sequence sub-convolution and space sub-convolution to obtain a new second long-distance space-time sub-feature;

for the Nth long-distance space-time sub-feature, adding a new previous-distance space-time sub-feature and the Nth long-distance space-time sub-feature to obtain a residual error feature;

processing the residual error characteristics through channel-level time sequence sub-convolution and space sub-convolution in sequence to obtain new Nth long-distance space-time sub-characteristics; wherein N is a positive integer greater than 2;

and splicing the first long-distance space-time sub-feature, the new second long-distance space-time sub-feature and the new Nth long-distance space-time sub-feature on a channel dimension to obtain the long-distance space-time feature.

8. A video classification apparatus, comprising:

the video to be classified acquisition module is used for acquiring videos to be classified; the content of the video to be classified comprises behavior actions of at least one target object;

the action classification result obtaining module is used for inputting the first video frame corresponding to the video to be classified into a target video classification model and obtaining an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are in cascade connection.

9. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the video classification method of any of claims 1-7.

10. A storage medium containing computer executable instructions for performing the video classification method of any one of claims 1-7 when executed by a computer processor.