CN109697434B

CN109697434B - Behavior recognition method and device and storage medium

Info

Publication number: CN109697434B
Application number: CN201910012006.3A
Authority: CN
Inventors: 王吉; 陈志博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2021-01-08
Anticipated expiration: 2039-01-07
Also published as: CN109697434A

Abstract

The invention discloses a behavior recognition method, a behavior recognition device and a storage medium. The scheme acquires a video to be detected and adds a plurality of candidate windows for the video to be detected; generating a three-dimensional feature map of a video to be detected containing a plurality of candidate windows on a plurality of time domain scales based on a feature extraction network; determining a time domain scale matched with the video clip in the candidate window, acquiring a three-dimensional characteristic diagram corresponding to the determined time domain scale, and acquiring a local characteristic diagram corresponding to the video clip according to the acquired three-dimensional characteristic diagram; and then, performing behavior recognition according to the local feature map and a preset behavior recognition network, and determining a behavior category corresponding to the behavior feature in the video clip. According to the scheme, the three-dimensional feature maps of the video to be detected on a plurality of time domain scales can be obtained from the video to be detected by using the feature extraction network, so that the receptive field of the classifier can adapt to behavior features of different time lengths, and the accuracy of behavior identification of various time spans is improved.

Description

Behavior recognition method and device and storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a behavior recognition method, a behavior recognition device and a storage medium.

Background

With the increasing demand for computer intelligence and the rapid development of pattern recognition technology, image processing technology and artificial intelligence technology, there is a great practical need for analyzing video content using computer vision technology, such as detecting human behavior in video. In the prior art, complex and various feature patterns are mostly learned from training data by means of a neural network hierarchical structure, so that features of an input video are effectively extracted, and specific behaviors are identified.

In practical application, most of surveillance videos and network videos are unsegmented long videos, the long videos may include multiple behavior instances, and the duration of each behavior instance may be different, however, in the existing behavior recognition scheme, the videos generally need to be compressed or expanded into video segments of specific frame numbers, and features of a single time domain scale are extracted from the video segments by using a neural network to recognize video behaviors, so that the receptive field of a classifier of the neural network can only be matched with behaviors of specific duration, and recognition accuracy is poor for behaviors of which the duration is too long or too short.

Disclosure of Invention

The embodiment of the invention provides a behavior recognition method, a behavior recognition device and a storage medium, and aims to improve the accuracy of behavior recognition of various time spans.

The embodiment of the invention provides a behavior identification method, which comprises the following steps:

acquiring a video to be detected, and adding a plurality of candidate windows for the video to be detected, wherein each candidate window corresponds to one video clip of the video to be detected;

generating a three-dimensional feature map of the video to be detected containing the candidate windows on a plurality of time domain scales based on a feature extraction network;

determining a time domain scale matched with the video segments in the candidate window, and acquiring the three-dimensional feature map corresponding to the determined time domain scale;

acquiring a local feature map corresponding to the video clip according to the acquired three-dimensional feature map;

and performing behavior recognition on the video clips in the candidate windows according to the local feature map and a preset behavior recognition network, and determining behavior categories corresponding to behavior features in the video clips.

The embodiment of the present invention further provides a behavior identification and detection apparatus, including:

the video acquisition unit is used for acquiring a video to be detected;

the video windowing unit is used for adding a plurality of candidate windows to the video to be detected, wherein each candidate window corresponds to one video clip of the video to be detected;

the characteristic acquisition unit is used for generating a three-dimensional characteristic diagram of the video to be detected containing the candidate windows on a plurality of time domain scales based on a characteristic extraction network;

the scale matching unit is used for determining a time domain scale matched with the video segments in the candidate window and acquiring the three-dimensional feature map corresponding to the determined time domain scale;

the feature selection unit is used for acquiring a local feature map corresponding to the video clip according to the acquired three-dimensional feature map;

and the behavior identification unit is used for performing behavior identification on the video clips in the candidate windows according to the local feature map and a preset behavior identification network and determining behavior categories corresponding to the behavior features in the video clips.

The embodiment of the present invention further provides a storage medium, where multiple instructions are stored in the storage medium, and the instructions are suitable for being loaded by a processor to perform any of the steps in the behavior recognition method provided in the embodiment of the present invention.

The embodiment of the invention acquires a video to be detected, adds a plurality of candidate windows to the video to be detected, each candidate window corresponds to a video segment of the video to be detected, then generates a three-dimensional feature map of the video to be detected comprising the candidate windows on a plurality of time domain scales based on a feature extraction network, wherein the longer the time domain scale is, the longer the time length corresponding to the features in the three-dimensional feature map is, determines the time domain scale matched with the video segments in the candidate windows and acquires the three-dimensional feature map corresponding to the determined time domain scale, acquires the local feature map corresponding to the video segment according to the acquired three-dimensional feature map, if the length of the video segment in the candidate window is smaller, the three-dimensional feature map with the smaller time domain scale can be selected to extract the local feature map, otherwise, the three-dimensional feature map with the larger time domain scale is selected to extract the local feature map, after the local feature map of the video segment in, and performing behavior recognition on the video clips in the candidate window according to the local feature map and a preset behavior recognition network, and determining behavior categories corresponding to behavior features in the video clips. According to the scheme, behaviors with various durations in the video to be detected can be identified, even if a section of video contains a plurality of behaviors with different durations, the three-dimensional feature map of the video to be detected on a plurality of time domain scales can be obtained from the video to be detected by using the feature extraction network, so that the receptive field of the classifier can adapt to the behavior features with different time lengths, and the accuracy of behavior identification of various time spans is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1a is a schematic view of a scene of an information interaction system according to an embodiment of the present invention;

fig. 1b is a first flowchart of a behavior recognition method according to an embodiment of the present invention;

FIG. 1c is a schematic diagram of a feature extraction network according to an embodiment of the present invention;

FIG. 1d is a schematic diagram of convolution kernels of a spatial domain and time domain separated feature extraction network provided by an embodiment of the present invention;

fig. 1e is a schematic diagram of a first network structure of a behavior recognition method according to an embodiment of the present invention;

fig. 1f is a schematic diagram of a second network structure of the behavior recognition method according to the embodiment of the present invention;

FIG. 1g is a schematic diagram of an interpolation operation provided by an embodiment of the present invention;

fig. 1h is a schematic diagram of a third network structure of the behavior recognition method according to the embodiment of the present invention;

fig. 1i is a schematic diagram of a fourth network structure of a behavior recognition method according to an embodiment of the present invention;

FIG. 1j is a schematic diagram of a network training process according to an embodiment of the present invention;

FIG. 1k is a schematic diagram of another training process of the network according to the embodiment of the present invention;

FIG. 2a is a flow chart of a behavior recognition application scenario provided by an embodiment of the present invention;

FIG. 2b is a schematic diagram of a behavior recognition application scenario provided by an embodiment of the present invention;

fig. 3a is a schematic structural diagram of a behavior recognition apparatus according to an embodiment of the present invention;

fig. 3b is a schematic structural diagram of a second behavior recognition device according to an embodiment of the present invention;

fig. 3c is a schematic structural diagram of a behavior recognition apparatus according to an embodiment of the present invention;

fig. 3d is a schematic diagram of a fourth structure of the behavior recognition device according to the embodiment of the present invention;

fig. 4 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a behavior recognition method, a behavior recognition device and a storage medium.

The embodiment of the invention also provides an information interaction system, which comprises any behavior recognition device provided by the embodiment of the invention, wherein the behavior recognition device can be specifically integrated in network equipment, such as equipment such as a terminal or a server; in addition, the system may further include other devices, for example, a video capture device or a terminal, and the terminal may be a mobile phone, a tablet computer, or a personal computer, and is used to upload the video to be detected to the network device.

Referring to fig. 1a, an embodiment of the present invention provides an information interaction system, which includes a video capture device and a behavior recognition apparatus; the behavior recognition device is connected with the video acquisition equipment through a wireless network or a wired network, receives a video to be detected sent by the video acquisition equipment, acquires the video to be detected, and adds a plurality of candidate windows for the video to be detected, wherein each candidate window corresponds to one video clip of the video to be detected; generating a three-dimensional feature map of a video to be detected containing a plurality of candidate windows on a plurality of time domain scales based on a feature extraction network; then, determining a time domain scale matched with the video clip, acquiring a three-dimensional feature map corresponding to the determined time domain scale, and intercepting a local feature map corresponding to the video clip from the determined three-dimensional feature map; and then, performing behavior recognition on the video clips in the candidate window according to the local feature map and a preset behavior recognition network, and determining behavior categories corresponding to the behavior features in the video clips.

Therefore, three-dimensional feature maps on a plurality of time domain scales in a long video can be extracted through the feature extraction network, matched time domain scales and corresponding three-dimensional feature maps are selected for video segments in the candidate window, then local feature maps are extracted from the three-dimensional feature maps, behavior recognition is carried out according to the local feature maps and the behavior recognition network, and behavior categories corresponding to behavior features in the video segments are determined. According to the scheme, the three-dimensional feature maps of the to-be-detected video on a plurality of time domain scales can be obtained from the to-be-detected video through the feature extraction network, the local feature map with the small time domain scale can be obtained for the video clip with the short duration, and the local feature map with the large time domain scale can be selected for the video clip with the long duration, so that the receptive field of the classifier in the behavior recognition network can adapt to behavior features of different time lengths, and the accuracy of behavior recognition of various time spans is improved.

The above example of fig. 1a is only an example of a system architecture for implementing the embodiment of the present invention, and the embodiment of the present invention is not limited to the system architecture shown in fig. 1a, and various embodiments of the present invention are proposed based on the system architecture.

In the present embodiment, description will be made from the viewpoint of a behavior recognizing apparatus, which may be specifically integrated in a terminal device such as a server, a personal computer, or the like.

As shown in fig. 1b, the specific flow of the behavior recognition method may be as follows:

101. the method comprises the steps of obtaining a video to be detected, and adding a plurality of candidate windows for the video to be detected, wherein each candidate window corresponds to one video clip of the video to be detected.

The video to be detected is composed of a series of consecutive video frame images, and the frame number of the video frame included in a segment of video is used in this embodiment to measure the length of the video to be detected, and hereinafter the length of the long video or video segment. After the video to be detected is acquired, a candidate window can be added to the video to be detected in the time dimension by adopting a multi-scale sliding window. The candidate windows are added frames which contain one or more frames of video frame images in the time dimension of the video to be detected, and each candidate window corresponds to one video clip of the video to be detected. For example, the length of the candidate window may have the following sizes: 8. 16, 32, 64, 128; or 7, 10, 15, 18, 25, etc., wherein various scales of the candidate window may be preset as needed, in units of frames. For example, if the length of the candidate window is 8, the candidate window contains a video segment containing 8 frames of video frame images. In order to improve the accuracy of behavior identification, there may be an overlap between two adjacent candidate windows, and the overlap may be set to 25% -75%, where the unit of the candidate window is also a frame.

102. And generating a three-dimensional feature map of the video to be detected containing the candidate windows on a plurality of time domain scales based on a feature extraction network.

The time domain scale in the embodiment of the application refers to the number of video frame images in a video to be detected corresponding to one feature in a three-dimensional feature map, and is used for measuring the size of the three-dimensional feature map in a time dimension. The larger the time domain scale is, the smaller the size of the three-dimensional feature map in the time dimension is, and the larger the number of video frame images in the video to be detected corresponding to one feature in the three-dimensional feature map is, or the larger the time domain scale is, the longer the duration of the video to be detected corresponding to one feature in the three-dimensional feature map is.

For example, a segment 800 × 256 of video to be detected can be extracted from the three-dimensional feature maps with three time domain scales, the sizes of which are 195 × 128, 195 × 64 and 195 × 32, respectively, and the sizes are all expressed in terms of length × width × frame number. 128. 64 and 32 represent the sizes of three-dimensional feature maps with three different time domain scales in the time dimension, the time domain scale of the three-dimensional feature map with the size of 195 × 128 is the smallest, and one feature in the three-dimensional feature map corresponds to 2 frames of images in the original video to be detected, namely the time domain scale of the three-dimensional feature map is 2. The time domain scale of the three-dimensional feature map with the size of 195 × 32 is the largest, and one feature in the three-dimensional feature map corresponds to 8 frames of images in the original video to be detected, that is, the time domain scale of the three-dimensional feature map is 8.

In the feature extraction stage, the candidate window is not used for segmenting the video to be detected, the input data of the feature extraction network is still a complete long video, and the video to be detected comprising a plurality of candidate windows is input into the pre-trained feature extraction network for feature extraction.

The feature extraction network is a three-dimensional Convolutional Neural network (3D-CNN), and a general three-dimensional Convolutional Neural network mainly includes a Convolutional layer, a pooling layer, a full-link layer, and the like. If the resolution of the video to be detected is high but the length is short, a pooling layer can be set, and the video is compressed in the spatial dimension so as to reduce the weight parameters of the network and the computational complexity, and the pooling is not performed in the time dimension so as to retain more characteristics in the time dimension as much as possible. The superparameters in the convolutional neural network, such as the size of a convolutional kernel, step length and zero padding quantity, the parameters such as the quantity of convolutional kernels in a convolutional layer, the quantity of the convolutional layer and the like can be set according to experience.

In order to realize feature mining on the time dimension between continuous video frame images, the convolution layer adopts a three-dimensional convolution kernel to perform feature extraction on a three-dimensional feature map input by the convolution layer, namely the size of the convolution kernel on the time dimension is greater than or equal to 2. One neuron of one convolution layer is connected with part of neurons in the previous layer, convolution operation is carried out layer by layer, and a three-dimensional feature map (feature map) is output; for example, after a segment of video with 800 × 256 frames is convolved with 8 × 2 convolution kernels according to 2 × 2 steps (i.e., the steps in three dimensions are 2), a three-dimensional feature map with 397 × 397 128, i.e., a continuous 128-frame image with 397 × 397, is output. In the present embodiment, since the video is three-dimensionally convolved, the feature maps output by the convolution layer are all three-dimensional feature maps, that is, feature maps of consecutive frames superimposed on two-dimensional feature maps.

Since the multiple scales of the three-dimensional feature map in the scheme are represented in the time domain, that is, in the time dimension, the following description mainly describes the situation of the three-dimensional feature map output by the convolution layer in the time dimension, and the situation of the size change of the video frame image after the convolution in the space domain is not described in detail. The operation can be performed by using convolution kernels with a conventional size in the spatial dimension, for example, the size of the convolution kernels in the spatial dimension is set to 3 × 3 or 2 × 2, the step size is 1, and the number of the convolution kernels is empirically set. The size of the convolution kernel in the time dimension can be set according to the size of the local features to be observed, if the size of the data volume after convolution in the time dimension is kept unchanged, the step length of the convolution operation in the time dimension can be set to be 1, and the corresponding zero filling quantity is set according to the size of the convolution kernel in the time dimension. If the size of the output data volume in the time dimension is to be reduced, a step size larger than 1 may be used, or the step size of the convolution kernel in the time dimension is set to 1, and a pooling layer is added after the convolution layer for down-sampling to reduce the size of the output data volume.

For example, referring to fig. 1c, the video to be detected, the first convolutional layer and the second convolutional layer are sequentially arranged from top to bottom; in the video to be detected, each cell is a frame of video frame image, and in the first layer convolutional layer and the second layer convolutional layer, each cell is a convolution kernel (i.e. a neuron). The length of an input video to be detected is 256 frames, the size of a convolution kernel of a first convolution layer in a time dimension is 2, the step length is 2, in the first convolution layer, one neuron corresponds to 2 frames of video frame images in the originally input video to be detected, the observed features among the 2 frames of video frame images are, and after the convolution operation, the length of an output three-dimensional feature map in the time dimension is 128. The size of the convolution kernel of the second convolutional layer in the time dimension is 2, the step length is 2, then in the second convolutional layer, one neuron is connected with two neurons in the first convolutional layer (assuming that no pooling layer is arranged between the two convolutional layers), and corresponds to the video frame image of 4 frames in the originally input video to be detected, the observed feature between the video frame images of 4 frames is the feature, and after the convolution operation, the length of the output three-dimensional feature map in the time dimension is 64. As the number of layers of the convolutional layer is increased, the size of the output feature map in the time dimension is gradually reduced, the number of frames of an original video (i.e., a video to be detected) connected with one neuron in the convolutional layer is gradually increased, that is, the receptive field of one neuron in the convolutional layer in the time dimension of the original video is gradually increased, and then the time domain scale of the three-dimensional feature map output by the convolutional layer is also gradually increased.

Specifically, in some embodiments, the step of generating a three-dimensional feature map of the video to be detected including the candidate windows on a plurality of time-domain scales based on the feature extraction network "includes:

inputting the video to be detected containing the candidate windows into the feature extraction network, and sequentially performing convolution operation on the convolution layers;

and acquiring a three-dimensional characteristic diagram output by a plurality of continuous convolutional layers as the three-dimensional characteristic diagram of the video to be detected on a plurality of time domain scales, wherein the deeper the layer number of the convolutional layers is, the larger the time domain scale is.

For example, the feature extraction network includes 7 convolutional layers, and output data of the last 5 convolutional layers can be extracted as a three-dimensional feature map of the video to be detected on five time domain scales.

Optionally, in some embodiments, the convolutional neural network employs a three-dimensional convolutional neural network that is separated in the spatial and time domains. Referring to fig. 1d, a schematic diagram of a convolution kernel structure in a feature extraction network according to an embodiment of the present invention is shown. Convolution kernels of size 1 x d are used in the spatial domain (assuming that the width and height of the video frame image are equal), and convolution kernels of size 1 x t are used in the temporal domain, i.e. one convolution layer contains one two-dimensional spatial domain convolution kernel and one-dimensional temporal domain convolution kernel. It should be noted that, when performing convolution operation on input data in one convolution layer, the spatial domain convolution and the time domain convolution are performed separately, and the time domain convolution operation may be performed first, or the spatial domain convolution operation may be performed first. For example, when convolution operation is performed on the convolution layer of the feature extraction network, convolution operation is performed on the input three-dimensional feature map by sequentially using the two-dimensional space convolution kernel and the one-dimensional time domain convolution kernel. Specifically, if the spatial domain convolution operation is performed first, the convolution layer performs two-dimensional spatial convolution operation on each frame of video frame image by using a two-dimensional spatial convolution kernel of 1 × d on the input continuous multi-frame video frame images respectively to obtain a plurality of continuous two-dimensional feature maps; then, a plurality of consecutive two-dimensional feature maps are checked using a one-dimensional time-domain convolution kernel of 1 × t, and a convolution operation is performed in the time dimension, that is, in the depth direction, on pixel data at the same position on the plurality of consecutive two-dimensional maps.

According to the scheme, the convolution is carried out on the three-dimensional video data, and the unsegmented long video is adopted, so that the weight parameters needing to be learned by the feature extraction network are very many, even redundancy exists, and overfitting is easily caused. Therefore, the three-dimensional convolutional neural network with the separated space domain and time domain is adopted, so that the overfitting degree of the network can be reduced and the accuracy of behavior detection can be improved while parameters are reduced and the calculation speed is improved.

In some embodiments, a convolution expansion operation may also be employed to obtain three-dimensional feature maps on multiple time-domain scales. Specifically, the feature extraction network is a three-dimensional convolutional neural network comprising a plurality of expansion convolutional layers; the step of generating a three-dimensional feature map of the video to be detected containing the candidate windows on a plurality of time domain scales based on the feature extraction network comprises the following steps:

inputting the video to be detected containing the candidate windows into the feature extraction network, and sequentially performing convolution operation in the expansion convolution layers according to corresponding expansion coefficients;

and acquiring a three-dimensional characteristic diagram output by a plurality of continuous expansion convolutional layers as the three-dimensional characteristic diagram of the video to be detected on a plurality of time domain scales.

The feature extraction network is a three-dimensional convolutional neural network comprising a plurality of expansion convolutional layers. And in the expansion convolutional layer, expansion convolution operation is carried out according to the expansion coefficient set by the convolutional layer, and the reception field of the neuron can be expanded by adopting an expansion convolution mode under the condition of not making pooling loss information, so that each convolution output contains information in a larger range. The expansion coefficients of the expansion convolutional layers can be set according to a required time domain scale, so that the output three-dimensional feature map has the required time domain scale. It should be noted that the dilation convolution operation herein may be performed only in the time dimension, or may be performed in both the space dimension and the time dimension.

103. And determining a time domain scale matched with the video segments in the candidate window, and acquiring the three-dimensional feature map corresponding to the determined time domain scale.

After the video to be detected containing a plurality of candidate windows is subjected to feature extraction, a three-dimensional feature map of the video to be detected on a plurality of time domain scales is obtained. Intuitively, the video segments within each candidate box may correspond to three-dimensional feature maps of multiple temporal scales. And then, according to the difference of the lengths of the candidate windows, selecting a matched time domain scale and a corresponding three-dimensional feature map for the video clips in the candidate windows. Specifically, the step of "determining a time-domain scale matched with the video segment in the candidate window, and obtaining the three-dimensional feature map corresponding to the determined time-domain scale" includes: determining the number of video frame images contained in the video segment in the candidate window; and according to the number, determining a time domain scale matched with the video segments, and acquiring a three-dimensional feature map corresponding to the determined time domain scale.

In the embodiment, the candidate window is added to the video to be detected by the multi-scale sliding window, so that the candidate window has various lengths. The selection of the time domain scale of the candidate window matching mainly involves the following parameters: the length of the candidate window, the length of the convolution kernel in the time dimension, and the step size of the convolution operation. For a convolutional layer, the time domain scale corresponding to a neuron in the convolutional layer can be obtained by calculation according to the length F of the convolutional kernel in the time dimension, the step length S of the convolutional operation, and the time domain scale corresponding to a neuron in the last convolutional layer. And selecting the time domain scale closest to the length of the candidate window as the time domain scale corresponding to the video clip in the candidate window, and further taking the three-dimensional feature map on the time domain scale as the three-dimensional feature map corresponding to the video clip in the candidate window. And if the scale of the candidate window is too long, and if the number of frames in the original video contained in the three-dimensional feature map of which the scale is larger than the maximum time domain scale is the number, selecting the maximum time domain scale as the time domain scale corresponding to the video segment in the candidate window.

104. And acquiring a local characteristic diagram corresponding to the video clip according to the acquired three-dimensional characteristic diagram.

After the time domain scale is determined, a three-dimensional feature map corresponding to the determined time domain scale is obtained from the multiple three-dimensional feature maps, and then a local feature map corresponding to the video clip is obtained according to the obtained three-dimensional feature map. For example, if the video segment in a certain candidate window is from the 80 th frame to the 102 th frame in the video to be detected, after the corresponding three-dimensional feature map is determined, the partial three-dimensional feature maps corresponding to the 80 th frame to the 102 th frame are cut from the three-dimensional feature map and serve as the local feature map corresponding to the video segment. According to the operation, the local feature map corresponding to the video segment of each candidate window can be extracted.

105. And performing behavior recognition on the video clips in the candidate windows according to the local feature map and a preset behavior recognition network, and determining behavior categories corresponding to behavior features in the video clips.

After the local feature maps of the video clips are obtained, behavior features in the video clips are identified according to the local feature maps and the behavior identification network, and behavior categories corresponding to the behavior features in the video clips are determined. Specifically, the step of performing behavior recognition on the video segment in the candidate window according to the local feature map and a preset behavior recognition network to determine a behavior category corresponding to the behavior feature in the video segment includes:

selecting a video clip containing behavior characteristics from the video clips in the candidate windows as a nomination clip according to the local feature map and a preset time domain nomination network;

and determining the behavior category corresponding to the nomination fragment according to the local feature map of the nomination fragment and the behavior recognition network.

Inputting the obtained local feature map of the video clips in the candidate window into a preset time domain nomination network, performing preliminary behavior detection, namely judging whether the video clip in one candidate window is a behavior clip or a background clip, and screening out the video clip containing specific behavior features from all the video clips to be used as the nomination clip. If a video segment does not contain any behavior feature, the video segment is a background segment.

After the nomination fragments are screened out, the local feature map of the nomination fragments is used as input data, and behavior recognition is carried out on the basis of a behavior recognition network so as to determine behavior categories corresponding to the nomination fragments.

Because the local feature map is obtained through a plurality of convolution operations, the behavior recognition network can be provided with at least one full connection layer without a convolution layer. In practical application, the scheme can be provided with a plurality of full connection layers, wherein the last full connection layer is a classifier and comprises M +1 nodes, M is a positive integer greater than 1, and M behavior classes and a background class are provided in total. And inputting the local feature map corresponding to the nomination fragment into the network, outputting the confidence of each node, wherein the category corresponding to the node with the highest confidence is the behavior category corresponding to the video fragment.

Referring to fig. 1e, the scheme integrates the following three networks to form a complete behavior recognition model: a feature extraction network, a time domain nomination network and a behavior recognition network. The output of the last network is used as the input of the next network, and finally, the actions contained in the video are recognized from a long video segment, and the action categories of the actions are determined.

Referring to fig. 1f, in order to improve the recognition accuracy of the behavior recognition network, the behavior recognition network is further provided with an interpolation (interpolation) layer for adjusting the local feature map of each nomination segment to a preset length in a time dimension. Specifically, the behavior recognition network comprises an interpolation layer and a full connection layer; the step of determining the behavior category corresponding to the nomination fragment according to the local feature map of the nomination fragment and the behavior recognition network comprises the following steps:

inputting the local feature map of the nomination segment into the behavior recognition network, and adjusting the local feature map to a preset length in a time dimension at the interpolation layer;

and inputting the local feature map output by the interpolation layer into the full-connection layer for behavior recognition, and determining the behavior category corresponding to the nomination segment.

The interpolation layer is used for enabling the length of all the nomination fragments in the time dimension to be equal to the preset length, so that input data of the full-connection layer have the same size in the time dimension, fine position information is reserved, and the classification accuracy of the full-connection layer is improved. Specifically, for example, to change the length of a video segment from 8 frames to 12 frames, the frame interpolation may be performed every two frames. Referring to fig. 1g, a frame indicated by a dotted line is an image frame added after interpolation, where the image frame can be calculated according to two adjacent image frames, for example, a linear interpolation algorithm or a bilinear interpolation algorithm can be used to adjust the length of the nominated fragment.

Referring to fig. 1h, in some embodiments, the time domain nomination network comprises a first fully connected layer and a second fully connected layer; the step of selecting a video clip containing behavior characteristics from the video clips in the candidate windows as a nomination clip according to the local feature map and a preset time domain nomination network comprises the following steps:

detecting whether the video clips in the candidate windows contain behavior characteristics or not according to the local characteristic graph and the first full-connection layer;

taking a video clip containing behavior characteristics as the nomination clip;

after selecting a video segment containing behavior features from the video segments in the candidate windows as a nomination segment, the method further comprises:

and performing boundary regression on the nomination fragments on the second full-connection layer to obtain a first time boundary of the nomination fragments.

The time domain nomination network comprises at least one first full-connection layer and at least one second full-connection layer, wherein the last first full-connection layer is used for detecting behaviors, nomination fragments are selected from all video fragments, the last second full-connection layer is used for boundary regression, and a first time boundary of behavior occurrence in the nomination fragments is preliminarily determined. Inputting the local feature maps of all nomination segments into a time domain nomination network, carrying out behavior detection through a first full connection layer, detecting whether the video segments contain behavior features or not, and taking the video segments containing the behavior features as the nomination segments; and performing boundary regression on the nomination segments at the second full-connection layer, and determining a first time boundary of each nomination segment, namely determining a starting frame and an ending frame of behavior occurrence in the nomination segments. Therefore, the behavior recognition method can not only recognize the types of behaviors contained in a long video, but also determine the time period of the behavior occurrence.

Referring to fig. 1i, in some embodiments, the fully-connected layers of the behavior recognition network include a third fully-connected layer and a fourth fully-connected layer; the step of inputting the local feature map output by the interpolation layer to the full-connection layer, performing behavior recognition on the nomination segment, and determining the behavior category corresponding to the nomination segment may include:

inputting the local feature map output by the interpolation layer into the third full-connection layer, performing behavior recognition on the nomination segment, and determining a behavior category corresponding to the nomination segment;

after determining the behavior category corresponding to the nomination fragment, the method further comprises the following steps:

and inputting the local feature map of the nomination segment to the fourth full-connection layer, and performing boundary regression on the nomination segment to obtain a second time boundary of the nomination segment.

The full-connection layer of the behavior recognition network may include a third full-connection layer and a fourth full-connection layer, the third full-connection layer is used for determining the behavior category of the nomination fragment, and the fourth full-connection layer is used for performing boundary regression on the nomination fragment after the behavior category again to obtain a second time boundary. So as to more accurately determine the time period of action occurrence in the nomination segment on the basis of the first time boundary, namely accurately positioning an action start frame and an action end frame.

Optionally, the embodiment of the present application further includes a training process of the network. In this implementation, a complete behavior recognition model is formed by three networks, and therefore, the three networks are trained as a whole. The method further comprises the following steps:

collecting a sample video, and adding a plurality of candidate windows to the sample video, wherein each candidate window corresponds to one video clip of the video to be detected;

adding a two-class label and a multi-class label to the video clip in each candidate window, wherein the two-class label comprises a behavior label and a background label;

generating a three-dimensional feature map of the sample video containing a plurality of candidate windows on a plurality of time domain scales according to the sample video added with the two-class label and the multi-class label and the feature extraction network after the weight initialization;

determining a time domain scale matched with the video clip, and acquiring a three-dimensional characteristic diagram corresponding to the determined time domain scale;

taking a local feature map with a behavior label as a positive sample, taking a local feature map with a background label as a negative sample, and inputting the positive sample and the negative sample into a time domain nomination network for training;

inputting the local feature map with the multi-classification labels into a behavior recognition network for training;

and repeatedly executing the steps to carry out iterative training until the loss functions of the time domain nomination network and the behavior recognition network are smaller than a preset threshold value, and determining the parameters of the characteristic extraction network, the time domain nomination network and the behavior recognition network.

Referring to fig. 1j, a sample video to which a two-class label and a multi-class label are added is taken as a training sample. The two-class labels are used in the training stage of the time domain nomination network, and the multi-class labels are used in the training stage of the behavior recognition network. The structure of the feature extraction network and necessary hyper-parameters are preset, and the network is weight initialized according to a preset algorithm, for example, by using an Xavier initialization method, a gaussian distribution initialization method, and the like.

In this embodiment, the time domain nomination network solves a binary problem, i.e. determines whether a video segment contains a specific behavior feature. The time domain nomination network comprises a convolutional layer and at least one fully-connected layer, wherein the fully-connected layer is provided with two nodes, and each node is fully connected with a neuron of the previous layer. The number of the convolutional layers may be set as needed, and is not particularly limited. The time domain nomination network is obtained by training a local feature map of a sample video with a two-class label. Behavior recognition networks solve a multi-classification problem. The network is obtained by training a convolutional neural network by using a local feature map of a sample video with a multi-class label.

A large number of long videos containing one or more specific actions are collected as sample videos, candidate windows are added to the sample videos according to the method shown in step 101, and a two-class label and a multi-class label are added to video segments in each candidate window, wherein the two-class label includes a behavior label and a background label to mark whether a behavior occurs in the video segment corresponding to the candidate frame. For example, if the label is label 1, the label is an action label and indicates that the video segment in the candidate frame contains the action feature, and if the label is label 0, the label is a background label and indicates that the video segment in the candidate frame does not contain the action feature, and the video segment is a background segment. The sample label used for training the behavior recognition network is a multi-classification label, namely an M +1 type label, when the label is label 1, 2, 3 … … M, a specific behavior of a corresponding class occurs in the video segment in the candidate window, and when the label is 0, the video segment in the candidate window is a background segment.

Inputting a sample video with two classification labels and multiple classification labels into a feature extraction network after weight initialization, extracting three-dimensional feature maps on multiple time domain scales, selecting corresponding time domain scales according to the length of a candidate window, and further intercepting a local feature map matched with a video segment in the candidate window from the complete three-dimensional feature map. After the local feature maps of the video clips in the candidate windows are obtained, each candidate window corresponds to one behavior label or one background label, and each local feature map corresponds to one behavior label or one background label; and taking the local feature map with the behavior label as a positive sample, taking the local feature map with the background label as a negative sample, and inputting the positive sample and the negative sample into a pre-constructed convolutional neural network for training. Regarding the behavior recognition network, the training method of the network is similar to the training method of the time domain nomination network. The difference is that the video segments within each candidate window employ multi-category labels. And inputting the local feature map of the behavior segment with the multi-class label into a behavior recognition network for training. And performing iterative training according to the process until the loss functions of the time domain nomination network and the behavior recognition network are smaller than a preset threshold value. Because the training process of the convolutional neural network is a process of continuously minimizing the loss function, when the size of the loss function reaches a target, namely is smaller than a preset threshold value, the training of the network is completed, and at the moment, the weight parameters of the three networks can be determined.

It is understood that in other embodiments, a complete behavior recognition model may be formed by the feature extraction network and the behavior recognition network, and the two networks may be trained as a whole. For sample videos, only multiple classification labels need to be added. Specifically, the training process includes:

adding a classification label to the video clip in each candidate window;

generating a three-dimensional feature map of the sample video containing the candidate window on a plurality of time domain scales according to the sample video added with the classification label and the feature extraction network after the initialization weight;

inputting the local feature map with the classification label into a behavior recognition network for training;

and repeatedly executing the steps to carry out iterative training until the loss function of the behavior recognition network is smaller than a preset threshold value, and determining the parameters of the feature extraction network and the behavior recognition network.

In some embodiments, during the training phase of the neural network, a detection plus segmentation multitask training scheme is employed. Referring to fig. 1k, a video segmentation network is added after the time domain nomination network, and is used for segmenting the video segments into video frame images, and adding labels to the segmented video frame images according to the labels corresponding to the video segments. The accurate frame segmentation task is introduced through the segmentation network, so that the model can learn the behavior classification of each frame, the weight parameters of the feature extraction network, the time domain nomination network and the behavior identification network can be optimized, and the accuracy of behavior detection can be obviously improved. And moreover, the segmentation task of each category can be trained through the sigmoid function and the cross entropy loss, and no competitive relationship exists among different categories.

As can be seen from the above, in the embodiment of the present invention, by acquiring a video to be detected, a plurality of candidate windows are added to the video to be detected, and each candidate window corresponds to one video clip of the video to be detected; generating a three-dimensional feature map of a video to be detected containing a plurality of candidate windows on a plurality of time domain scales based on a feature extraction network; then, determining a time domain scale matched with the video clip, acquiring a three-dimensional feature map corresponding to the determined time domain scale, acquiring a local feature map corresponding to the video clip according to the determined three-dimensional feature map, if the length of the video clip in the candidate window is smaller, selecting the three-dimensional feature map with the smaller time domain scale to extract the local feature map, otherwise, selecting the three-dimensional feature map with the larger time domain scale to extract the local feature map, extracting the local feature map of the video clip in each candidate window, and then performing behavior recognition according to the local feature map and a preset behavior recognition network to determine behavior categories corresponding to the behavior features in the video clip. According to the scheme, behaviors with various durations in the video to be detected can be identified, even if a section of video contains a plurality of behaviors with different durations, the three-dimensional feature map of the video to be detected on a plurality of time domain scales can be obtained from the video to be detected by using the feature extraction network, so that the receptive field of the classifier can adapt to the behavior features with different time lengths, and the accuracy of behavior identification of various time spans is improved.

The method according to the preceding embodiment is illustrated in further detail below by way of example.

For example, referring to fig. 2a and 2b, in this embodiment, a description will be given by taking an example in which the video behavior detection apparatus is integrated in a network device server.

And (I) training a convolutional neural network.

The stage mainly comprises training of a feature extraction network, a time domain nomination network and a behavior recognition network. In this scheme, the three networks are trained as a whole, and the specific training method refers to the above embodiments and is not described herein again.

And (II) acquiring the video to be detected.

And the server receives the video to be detected sent by the video acquisition equipment.

And (III) adding a candidate window for the video to be detected.

A plurality of candidate windows with different sizes are added to the video to be detected based on the multi-scale sliding window, and the candidate windows can be overlapped. Referring to fig. 2b, a first candidate window to an nth candidate window are added to the video to be detected, for n candidate windows.

And (IV) acquiring local feature maps matched with the video clips in the candidate windows.

Inputting the acquired video to be detected into a pre-trained feature extraction network, outputting a three-dimensional feature map of the video to be detected on a plurality of time domain scales, and referring to fig. 2b, extracting a three-dimensional feature map of i time domain scales, such as a first time domain scale, a second time domain scale … …, an ith time domain scale and the like, according to the feature extraction network, wherein a specific value of i can be set by a user according to needs, for example, i is 3-7. Then, local feature maps corresponding to the video clips in the first candidate window to the nth candidate window are extracted from the three-dimensional feature map. And obtaining the local feature map matched with the video clips in each candidate window.

A specific feature extraction network is presented here for illustration. Assume in one embodiment that the feature extraction network includes the following six convolutional layers: convolutional layers 1, 2, 3, 4, 5, and 6, the sizes F and S of the convolutional kernels of the six convolutional layers in the time dimension are 2 and the step length S is 2. The time domain scale of the three-dimensional characteristic graph output by one convolutional layer is as follows: the number of video frame images in the original video connected in one neuron of the convolutional layer, so that in the convolutional layer 1, one neuron (namely, convolutional kernel) is connected with two frames of video frame images in the video to be detected, and the time domain scale of the three-dimensional characteristic diagram output by the layer is 2; in the convolutional layer 2, one neuron is connected with two neurons in the convolutional layer 1 and is connected with video frame images of four continuous frames in a video to be detected, and after convolution operation, the time domain scale of a three-dimensional characteristic diagram output by the layer is 4; by analogy, the time domain scale of the three-dimensional characteristic diagram output by the convolutional layer 3 can be calculated to be 8; the time domain scale of the three-dimensional characteristic diagram output by the convolutional layer 4 is 16; the time domain scale of the three-dimensional feature map output by the convolutional layer 5 is 32; the time domain scale of the three-dimensional feature map output by convolutional layer 6 is 64. The maximum time-domain scale of the three-dimensional feature map that the network can output is 64.

In addition, for the feature extraction network, the time domain scale L corresponding to the candidate box may be calculated according to the following formula.

Wherein k is₀For the reference value, 6 is set, i.e. the number of layers of the last convolutional layer, 64 is the time domain scale of the three-dimensional feature map output by the layer, and ω is the length of the candidate window. This formula is applicable to convolution operations with a step size of 2, where 64 can be replaced with other values depending on the specific time domain scale of the signature of the output of the last convolution layer. Assuming that the candidate window length is 32, it can be calculated that k is 4 and the temporal scale L is 32, that is, the three-dimensional feature map output by the convolutional layer 4 can be selected as the three-dimensional feature map matching the video segment in the candidate window, and at this time, the local feature map corresponding to the video segment in the candidate window can be extracted from the fourth convolutional layer output data. For example, if the length of the video to be detected is 512 frames, the position of one candidate window is 33 th to 64 th frames, and the length is 32 frames, the local feature map corresponding to the 32 frames of video frame images may be extracted from the complete three-dimensional feature map.

It will be appreciated that the scale of the sliding window in the above example exactly matches the time domain scale, and if other sliding window scales are used, for example, the window sizes are 10, 20, 30, 40, etc., then there may be fractions of the k value calculated according to the above formula, and the rounding operation is performed.

And (V) screening nominated fragments from the video to be detected, and determining a first time boundary of behavior occurrence.

And screening out segments possibly containing specific behaviors from the candidate window through a pre-trained time domain nomination network to serve as nomination segments, and preliminarily determining the time period of occurrence of the behaviors in the segments, namely determining a first time boundary. For a specific implementation, reference may be made to the description in step 104 of the foregoing behavior recognition method embodiment, which is not described herein again. Referring to fig. 2b, a nomination segment 1 and a nomination segment 2 … … nomination segment k are screened from a video to be detected, and k nomination segments are selected.

And (VI) determining the behavior category of the nominated fragment and determining a second time boundary of the behavior.

The behavior identification network comprises a classification sub-network and a regression sub-network, wherein the last full-connection layer in the classification sub-network plays a role of a classifier, can identify the behavior classes of the M +1 classes including the background segment, accurately identifies the classes of behaviors in the selected nomination segment, and judges that the nomination segment is the background segment if detecting that the nomination segment does not contain any class of behaviors. After the behavior category corresponding to each nomination segment is determined, boundary regression is carried out on the nomination segments (not including background segments) of the determined behavior category through a regression sub-network so as to determine a second time boundary of the behavior, and further the specific time period of the behavior is determined according to the second time boundary.

By the scheme, the server can perform behavior identification on a long video sent by the video acquisition equipment, determine the behavior category of one or more specific behaviors occurring in the long video, and determine the occurrence time period of each behavior. And sending the behavior category and the behavior occurrence time period to the video acquisition equipment.

In order to implement the above method, an embodiment of the present invention further provides a behavior recognition apparatus, which may be specifically integrated in a terminal device, such as a server or a personal computer.

For example, as shown in fig. 3a, the behavior recognition apparatus may include a video acquisition unit 301, a video windowing unit 302, a feature acquisition unit 303, a scale matching unit 304, a feature selection unit 305, and a behavior recognition unit 306, as follows:

a video acquisition unit 301;

a video acquiring unit 301, configured to acquire a video to be detected. The video to be detected can be collected by video collecting equipment in real time or can be uploaded by a user through a terminal.

(ii) a video windowing unit 302;

a video windowing unit 302, configured to add a plurality of candidate windows to the video to be detected, where each candidate window corresponds to a video segment of the video to be detected.

The video to be detected is composed of a series of consecutive video frame images, and in this embodiment, the number of video frames included in a segment of video is used to measure the length of a long video or video segment. After the video obtaining unit 301 obtains the video to be detected, the video windowing unit 302 adds a candidate window to the video to be detected in a time dimension by using a multi-scale sliding window, for example, the following windows of several sizes may be used: 8. 16, 32, 64, 128. In order to improve the accuracy of behavior recognition, there may be an overlap between two adjacent candidate windows, and the overlap may be set to 25% -75%.

(iii) a feature acquisition unit 303;

the feature obtaining unit 303 is configured to generate a three-dimensional feature map of the video to be detected, which includes a plurality of candidate windows, on a plurality of time domain scales based on the feature extraction network.

The feature extraction network is a three-dimensional convolutional neural network, and the feature extraction network in the embodiment comprises a plurality of convolutional layers without full connection layers. The feature acquisition unit 303 performs feature extraction on the three-dimensional feature map input at the current layer by using a three-dimensional convolution kernel, that is, the size of the convolution kernel in the time dimension is greater than or equal to 2. And one neuron of one convolutional layer is connected with only part of neurons in the previous layer, and the convolutional operation is carried out layer by layer to output a three-dimensional characteristic diagram. For a specific implementation, reference may be made to the description in step 102 of the foregoing behavior recognition method embodiment, which is not described herein again.

Referring to fig. 3b, in some embodiments, the feature extraction network is a three-dimensional convolutional neural network including a plurality of convolutional layers, and the feature obtaining unit 303 may include a convolution operator unit 3031 and a feature obtaining unit 3032, where:

a convolution operation subunit 3031, configured to input the video to be detected including the multiple candidate windows into the feature extraction network, and perform convolution operation on the multiple convolution layers in sequence;

the feature obtaining subunit 3032 is configured to obtain a three-dimensional feature map output by a plurality of last consecutive convolutional layers, as the three-dimensional feature map of the video to be detected on a plurality of time domain scales, where the deeper the number of layers of the convolutional layers, the larger the time domain scale is.

In some embodiments, the convolutional neural network employs a three-dimensional convolutional neural network that is separated in the spatial and time domains. Referring to fig. 1d, a schematic diagram of a convolution kernel structure in a feature extraction network according to an embodiment of the present invention is shown. Convolution kernels of size 1 x d are used in the spatial domain and 1 x t in the time domain. That is, a convolution layer includes a two-dimensional space domain convolution kernel and a one-dimensional time domain convolution kernel, and it should be noted that, when performing convolution operation on input data in a convolution layer, the space domain convolution and the time domain convolution operation are performed separately, and the time domain convolution operation may be performed first, or the space domain convolution operation may be performed first. The convolution operation subunit is further configured to perform convolution operation on the input three-dimensional feature map sequentially using the two-dimensional space convolution kernel and the one-dimensional time domain convolution kernel when performing convolution operation on the convolution layer of the feature extraction network. Specifically, if the spatial domain convolution operation is performed first, the convolution layer performs two-dimensional spatial convolution operation on each frame of video frame image by using a two-dimensional spatial convolution kernel of 1 × d on the input continuous multi-frame video frame images respectively to obtain a plurality of continuous two-dimensional feature maps; then, a plurality of consecutive two-dimensional feature maps are checked using a one-dimensional time-domain convolution kernel of 1 × t, and a convolution operation is performed in the time dimension, that is, in the depth direction, on pixel data at the same position on the plurality of consecutive two-dimensional maps.

According to the scheme, the convolution is carried out on the three-dimensional video data, and the unsegmented long video is adopted, so that the weight parameters needing to be learned by the feature extraction network are very many, even redundancy exists, and overfitting is easily caused. Therefore, the three-dimensional convolution neural network with the spatial domain and the time domain separated is adopted, so that the overfitting degree of the network can be reduced while the parameters are reduced and the calculation speed is improved, and the detection effect is improved compared with that of the common three-dimensional convolution.

In some embodiments, a convolution expansion operation may also be employed to obtain three-dimensional feature maps on multiple time-domain scales. Specifically, the feature extraction network is a three-dimensional convolutional neural network comprising a plurality of expansion convolutional layers; the feature acquisition unit 303 may be further configured to:

inputting the video to be detected containing a plurality of candidate windows into the feature extraction network, and sequentially performing convolution operation in the plurality of expansion convolution layers according to corresponding expansion coefficients;

The feature extraction network is a three-dimensional convolutional neural network comprising a plurality of expansion convolutional layers. And in the expansion convolutional layer, expansion convolution operation is carried out according to the expansion coefficient set by the convolutional layer, and the reception field of the neuron can be expanded by adopting an expansion convolution mode under the condition of not making pooling loss information, so that each convolution output contains information in a larger range. Therefore, the expansion coefficient of each expansion convolution layer can be set according to the required time domain scale, so that the output three-dimensional feature map has the required time domain scale.

A (four) scale matching unit 304;

a scale matching unit 304, configured to determine a time-domain scale matched with the video segment in the candidate window, and obtain the three-dimensional feature map corresponding to the determined time-domain scale.

After the three-dimensional feature maps of the long video on multiple time domain scales are acquired, it is intuitive that the video segment in each candidate frame may correspond to the three-dimensional feature maps of the multiple time domain scales. Next, the scale matching unit 304 selects a three-dimensional feature map of a matched time-domain scale for the video segments in each candidate window according to the difference of the lengths of the candidate windows.

Referring to fig. 3c, in some embodiments, the feature extraction network is a three-dimensional convolutional neural network including a plurality of convolutional layers, and the scale matching unit 304 may include a number determination subunit 3041 and a scale determination subunit 3042, where:

a number determining subunit 3041, configured to determine the number of video frame images included in the video segment in the candidate window;

a scale determining subunit 3042, configured to determine, according to the number, a time domain scale matched with the video segment, and obtain a three-dimensional feature map corresponding to the determined time domain scale.

(five) a feature selection unit 305;

a feature selection unit 305, configured to obtain a local feature map corresponding to the video segment according to the obtained three-dimensional feature map.

After determining the time-domain scale and the corresponding three-dimensional feature map, the feature selection unit 305 cuts out a local feature map corresponding to the video segment from the determined three-dimensional feature.

For a specific implementation, reference may be made to the descriptions in step 103 and step 104 of the foregoing embodiments of the behavior recognition method, which are not described herein again.

(sixth) a behavior recognizing unit 306;

the behavior recognition unit 306 is configured to perform behavior recognition on the video segment in the candidate window according to the local feature map and a preset behavior recognition network, and determine a behavior category corresponding to the behavior feature in the video segment.

After the local feature maps of the video clips are obtained, behavior features in the video clips are identified according to the local feature maps and the behavior identification network, and behavior categories corresponding to the behavior features in the video clips are determined. Referring to fig. 3d, in some embodiments, the behavior identification unit 306 includes a fragment filtering subunit 3061 and a behavior identification subunit 3062, wherein:

a segment screening subunit 3061, configured to select, according to the local feature map and a preset time domain nomination network, a video segment containing behavior features from the video segments in the multiple candidate windows as a nomination segment;

and the behavior recognition subunit 3062 is configured to determine a behavior category corresponding to the nomination fragment according to the local feature map of the nomination fragment and the behavior recognition network.

The obtained local feature map of the video segment in the candidate window is input into a preset time domain nomination network through the segment screening subunit 3061 to perform preliminary behavior detection, that is, to judge whether the video segment of one candidate window is a behavior segment or a background segment, and to screen out a video segment containing specific behavior features from all the video segments as the nomination segment. If a video segment does not contain any behavior feature, the video segment is a background segment.

After the segment filtering subunit 3061 filters out the nomination segment containing the behavior feature, the behavior identification subunit 3062 uses the local feature map of the nomination segment as input data to perform behavior identification based on the behavior identification network, so as to determine the behavior category corresponding to the nomination segment. Because the local feature map is obtained through a plurality of convolution operations, the behavior recognition network can be provided with at least one full connection layer without a convolution layer. In practical application, the scheme can be provided with a plurality of full connection layers, wherein the last full connection layer is a classifier and comprises M +1 nodes, M is a positive integer greater than 1, and M behavior classes and a background class are provided in total. And inputting the local feature map corresponding to the nomination fragment into the network, outputting the confidence of each node, wherein the category corresponding to the node with the highest confidence is the behavior category corresponding to the video fragment.

As can be seen from the above, in the embodiment of the present invention, the video acquisition unit 301 acquires a video to be detected, and then the video windowing unit 302 adds a plurality of candidate windows to the video to be detected, where each candidate window corresponds to one video clip of the video to be detected; the feature obtaining unit 303 generates a three-dimensional feature map of a video to be detected, which includes a plurality of candidate windows, on a plurality of time domain scales based on a feature extraction network; then, the scale matching unit 304 determines a time-domain scale matched with the video segment and obtains a three-dimensional feature map corresponding to the determined time-domain scale, the feature selection unit 305 obtains a local feature map corresponding to the video segment according to the determined three-dimensional feature map, if the length of the video segment in the candidate window is smaller, the three-dimensional feature map with a smaller time-domain scale can be selected to extract the local feature map, otherwise, the three-dimensional feature map with a larger time-domain scale is selected to extract the local feature map, after the local feature map of the video segment in each candidate window is extracted, the behavior recognition unit 306 performs behavior recognition according to the local feature map and a preset behavior recognition network, and determines a behavior category corresponding to the behavior feature in the video segment. According to the scheme, behaviors with various durations in the video to be detected can be identified, even if a section of video contains a plurality of behaviors with different durations, the three-dimensional feature map of the video to be detected on a plurality of time domain scales can be obtained from the video to be detected by using the feature extraction network, so that the receptive field of the classifier can adapt to the behavior features with different time lengths, and the accuracy of behavior identification of various time spans is improved.

In some embodiments, the time domain nomination network comprises a first fully connected layer and a second fully connected layer; the fragment filter subunit 3061 may also be used for:

detecting whether the video clip contains behavior characteristics or not according to the local characteristic graph and the first full-connection layer;

taking a video clip containing behavior characteristics as the nomination clip;

the behavior identification device further comprises a first regression unit, wherein the first regression unit is used for performing boundary regression on the nomination segment on the second full-connection layer to obtain a first time boundary of the nomination segment.

In some embodiments, the behavior recognition network includes an interpolation layer and a fully connected layer; the behavior recognition subunit 3062 may also be used to:

and inputting the local feature map output by the interpolation layer into the full-connection layer, performing behavior recognition on the nomination segment, and determining a behavior category corresponding to the nomination segment.

In some embodiments, the fully-connected layers of the behavior recognition network include a third fully-connected layer and a fourth fully-connected layer; the behavior recognition subunit 3062 may also be used to: and inputting the local feature map output by the interpolation layer into the third full-connection layer for behavior recognition, inputting the local feature map output by the interpolation layer into the third full-connection layer, performing behavior recognition on the nomination segment, and determining the behavior category corresponding to the nomination segment. The behavior recognition device further comprises a second regression unit, wherein the second regression unit is used for inputting the local feature map of the nomination segment to the fourth full-connection layer, performing boundary regression on the nomination segment and acquiring a second time boundary of the nomination segment.

In some embodiments, the behavior recognition device may further include a network training unit, and the network training unit may be configured to: collecting a sample video, and adding a plurality of candidate windows to the sample video, wherein each candidate window corresponds to one video clip of the video to be detected; adding a two-class label and a multi-class label to the video clip in each candidate window, wherein the two-class label comprises a behavior label and a background label; generating a three-dimensional feature map of the sample video containing a plurality of candidate windows on a plurality of time domain scales according to the sample video added with the two-class label and the multi-class label and the feature extraction network after the weight initialization; determining a time domain scale matched with the video clip, and acquiring a three-dimensional characteristic diagram corresponding to the determined time domain scale; acquiring a local feature map corresponding to the video clip according to the acquired three-dimensional feature map; taking a local feature map with a behavior label as a positive sample, taking a local feature map with a background label as a negative sample, and inputting the positive sample and the negative sample into a time domain nomination network for training; inputting the local feature map with the multi-classification labels into a behavior recognition network for training; and repeatedly executing the steps to carry out iterative training until the loss functions of the time domain nomination network and the behavior recognition network are smaller than a preset threshold value, and determining the parameters of the characteristic extraction network, the time domain nomination network and the behavior recognition network. For specific implementation, reference may be made to the description in the foregoing embodiment of the behavior recognition method, and details are not described here.

An embodiment of the present invention further provides a server, as shown in fig. 4, which shows a schematic structural diagram of the server according to the embodiment of the present invention, specifically:

the server may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the server architecture shown in FIG. 4 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

Wherein:

the processor 401 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the server. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The server further includes a power supply 403 for supplying power to each component, and preferably, the power supply 403 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The server may also include an input unit 404, the input unit 404 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the server may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 401 in the server loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

In some embodiments, the feature extraction network is a three-dimensional convolutional neural network including a plurality of convolutional layers, and the processor 401 runs an application program stored in the memory 402, and may further implement the following functions:

In some embodiments, the convolution layers in the feature extraction network comprise a two-dimensional spatial convolution kernel and a one-dimensional time-domain convolution kernel; the processor 401 runs the application program stored in the memory 402 and may also implement the following functions:

and when convolution operation is carried out on the convolution layer of the feature extraction network, carrying out convolution operation on the input three-dimensional feature map by sequentially using the two-dimensional space convolution kernel and the one-dimensional time domain convolution kernel.

In some embodiments, the feature extraction network is a three-dimensional convolutional neural network comprising a plurality of dilation convolutional layers; the processor 401 runs the application program stored in the memory 402 and may also implement the following functions:

In some embodiments, the processor 401 runs an application program stored in the memory 402, and may also implement the following functions:

determining the number of video frame images contained in the video segment in the candidate window;

and according to the number, determining a time domain scale matched with the video segments, and acquiring a three-dimensional feature map corresponding to the determined time domain scale.

In some embodiments, the time domain nomination network comprises a first fully connected layer and a second fully connected layer; the processor 401 runs the application program stored in the memory 402 and may also implement the following functions:

taking a video clip containing behavior characteristics as the nomination clip;

and after selecting a video clip containing behavior characteristics from the video clips in the candidate windows as a nomination clip, performing boundary regression on the nomination clip in the second full-connected layer to obtain a first time boundary of the nomination clip.

In some embodiments, the behavior recognition network includes an interpolation layer and a fully connected layer; the processor 401 runs the application program stored in the memory 402 and may also implement the following functions:

In some embodiments, the fully-connected layers of the behavior recognition network include a third fully-connected layer and a fourth fully-connected layer; the processor 401 runs the application program stored in the memory 402 and may also implement the following functions:

and after determining the behavior category corresponding to the nomination segment, inputting the local feature map of the nomination segment to the fourth full-link layer, and performing boundary regression on the nomination segment based on the first time boundary to obtain a second time boundary of the nomination segment.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present invention provides a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the behavior recognition methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium may execute the steps in any behavior recognition method provided in the embodiments of the present invention, beneficial effects that can be achieved by any behavior recognition method provided in the embodiments of the present invention may be achieved, which are detailed in the foregoing embodiments and will not be described herein again. The above detailed description is provided for a behavior recognition method, apparatus and storage medium according to the embodiments of the present invention, and the specific examples are applied herein to explain the principles and embodiments of the present invention, and the description of the above embodiments is only used to help understand the method and core ideas of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of behavior recognition, comprising:

acquiring a video to be detected, and adding a plurality of candidate windows for the video to be detected, wherein each candidate window corresponds to one video clip of the video to be detected, and two adjacent candidate windows in the candidate windows are overlapped;

generating a three-dimensional feature map of the video to be detected containing the candidate windows on a plurality of time domain scales based on a feature extraction network, wherein the three-dimensional feature map is formed by overlapping continuous multi-frame two-dimensional feature maps;

2. The behavior recognition method according to claim 1, wherein the feature extraction network is a three-dimensional convolutional neural network including a plurality of convolutional layers, and based on the feature extraction network, a three-dimensional feature map of the video to be detected including the plurality of candidate windows on a plurality of time-domain scales is generated, including:

3. The behavior recognition method according to claim 2, wherein the convolution layers in the feature extraction network include a two-dimensional spatial convolution kernel and a one-dimensional time-domain convolution kernel; the method further comprises the following steps:

4. The behavior recognition method according to claim 1, wherein the feature extraction network is a three-dimensional convolutional neural network including a plurality of swelling convolutional layers; generating a three-dimensional feature map of the video to be detected containing the candidate windows on a plurality of time domain scales based on a feature extraction network, wherein the three-dimensional feature map comprises:

5. The behavior recognition method according to claim 1, wherein determining a temporal scale that matches a video segment within the candidate window and obtaining the three-dimensional feature map corresponding to the determined temporal scale comprises:

6. The behavior recognition method according to any one of claims 1 to 5, wherein performing behavior recognition on the video segment in the candidate window according to the local feature map and a preset behavior recognition network, and determining a behavior category corresponding to a behavior feature in the video segment includes:

7. The behavior recognition method of claim 6, wherein the time-domain nomination network comprises a first fully-connected layer and a second fully-connected layer; selecting a video clip containing behavior characteristics from the video clips in the candidate windows according to the local feature map and a preset time domain nomination network, wherein the video clip is used as a nomination clip and comprises the following steps:

taking a video clip containing behavior characteristics as the nomination clip;

8. The behavior recognition method according to claim 7, wherein the behavior recognition network includes an interpolation layer and a full connection layer; determining the behavior category corresponding to the nomination fragment according to the local feature map of the nomination fragment and the behavior recognition network, wherein the determining comprises the following steps:

9. The behavior recognition method of claim 8, wherein the fully-connected layers of the behavior recognition network comprise a third fully-connected layer and a fourth fully-connected layer; inputting the local feature map output by the interpolation layer into the full-connection layer, performing behavior recognition on the nomination segment, and determining a behavior category corresponding to the nomination segment, wherein the method comprises the following steps:

10. A behavior recognition apparatus, comprising:

the video acquisition unit is used for acquiring a video to be detected;

the video windowing unit is used for adding a plurality of candidate windows to the video to be detected, wherein each candidate window corresponds to one video clip of the video to be detected, and two adjacent candidate windows in the candidate windows are overlapped;

the characteristic acquisition unit is used for generating a three-dimensional characteristic diagram of the video to be detected containing the candidate windows on a plurality of time domain scales based on a characteristic extraction network, and the three-dimensional characteristic diagram is formed by overlapping continuous multi-frame two-dimensional characteristic diagrams;

11. The behavior recognition device according to claim 10, wherein the feature extraction network is a three-dimensional convolutional neural network including a plurality of convolutional layers, the feature acquisition unit includes:

a convolution operation subunit, configured to input the video to be detected including the multiple candidate windows into the feature extraction network, and sequentially perform convolution operation on the multiple convolution layers;

and the characteristic obtaining subunit is used for obtaining a three-dimensional characteristic diagram output by the last continuous plurality of convolutional layers as the three-dimensional characteristic diagram of the video to be detected on a plurality of time domain scales, wherein the deeper the layer number of the convolutional layers is, the larger the time domain scale is.

12. The behavior recognition apparatus according to claim 10, wherein the scale matching unit includes:

a number determining subunit, configured to determine the number of video frame images included in the video segment in the candidate window;

and the scale determining subunit is used for determining the time domain scale matched with the video segments according to the number and acquiring the three-dimensional feature map corresponding to the determined time domain scale.

13. The behavior recognition apparatus according to any one of claims 10 to 12, characterized in that the behavior recognition unit includes:

a segment screening subunit, configured to select, according to the local feature map and a preset time domain nomination network, a video segment containing behavior features from the video segments in the multiple candidate windows, as a nomination segment;

and the behavior identification subunit is used for determining the behavior category corresponding to the nomination fragment according to the local feature map of the nomination fragment and the behavior identification network.

14. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the behavior recognition method according to any one of claims 1 to 9.