CN111488805A

CN111488805A - Video behavior identification method based on saliency feature extraction

Info

Publication number: CN111488805A
Application number: CN202010210957.4A
Authority: CN
Inventors: 胡晓; 向俊将; 杨佳信
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2020-03-24
Filing date: 2020-03-24
Publication date: 2020-08-04
Anticipated expiration: 2040-03-24
Also published as: CN111488805B

Abstract

The invention relates to a video behavior identification method based on salient feature extraction, which comprises the steps of S1, acquiring a video to be identified, and converting the video to be identified into an image; s2, extracting the salient features of the image, and removing the background information in the image; s3, performing behavior recognition on the image; s4, outputting the recognized abnormal behavior. The method combines the salient object detection method and the behavior recognition method for the first time, on one hand, the interesting region in the video is extracted, the main characteristics are reserved, the background characteristics are shielded, on the other hand, the calculation amount is reduced, and the method can be beneficial to the detection and recognition of abnormal behaviors.

Description

Video behavior identification method based on saliency feature extraction

Technical Field

The invention relates to the technical field of intelligent video monitoring, in particular to a video behavior identification method based on saliency feature extraction.

Background

With the development of economy and the improvement of legal systems, people pay more attention to the criminal behavior for preventing the safety of lives and properties. Video monitoring systems are beginning to be applied to people's lives, such as theft prevention in daily life, terrorist prevention in dense people, and the like.

The current behavior recognition methods mainly comprise two types: a method based on traditional feature extraction and a method based on deep learning. The traditional feature extraction-based method classifies behaviors by extracting various features such as optical flow Histogram (HOF), gradient Histogram (HOG), Motion Boundary Histogram (MBH) and the like in video. However, the recognition capability is easily affected by the illumination intensity and background information, and the extracted features have certain limitations and do not have good recognition capability.

With the development of the times, researchers have proposed deep learning, and in view of the fact that tasks in research fields such as vision and hearing can be effectively completed based on the deep learning, the market has begun to apply a method based on the deep learning to real life. Video surveillance is one such. On the basis of a deep learning theory, researchers build a good network model, and train a data set through a video with a mark to obtain a model with recognition capability. The model has good generalization capability and can classify untrained video data. In the current method, a video is directly input into a neural network, video data is not processed, or only exposure and distortion processing are performed on the video, but characteristic information carried by the video is not highlighted, which is not beneficial to detection of abnormal behaviors. In a real complex context, it may result in an inability to identify abnormal behavior. And the existing training based on the deep learning network model needs a large amount of data sets and a high-performance server, which brings great limitation to the actual video identification work.

In summary, there is a need in the industry to develop a method or system that can highlight the characteristic information carried by the video, reduce the amount of computation, and facilitate the detection and identification of abnormal behavior.

Disclosure of Invention

Aiming at the defect that the characteristic information carried by the video is not highlighted in the prior art, the invention designs a video behavior identification method based on the significant characteristic extraction.

The specific scheme of the application is as follows:

a video behavior identification method based on salient feature extraction comprises the following steps:

s1, acquiring a video to be identified, and converting the video to be identified into an image;

s2, extracting the salient features of the image, and removing the background information in the image;

s3, performing behavior recognition on the image;

s4, outputting the recognized abnormal behavior.

Preferably, step S2 includes:

s21, inputting the image into Res2Net backbone network, when the image passes through each layer of Res2Net backbone network, the image characteristic will be divided into saliency characteristic, edge characteristic, the Res2Net backbone network outputs saliency characteristic S0 and edge characteristic E0 finally;

s22, alternately training the saliency feature S0 and the edge feature E0 in the CRU unit as supervisory signals to generate the saliency feature S and the edge feature E;

s23, in the asymmetric crossing module, loss is made between the salient feature S and a salient label L abel-S, loss is made between the edge feature E and an edge label L abel-E, meanwhile, an edge feature E is extracted from the salient feature S, and loss is made between the edge feature E and the edge of the salient label.

Preferably, step 21 includes the shallow image feature first passing through a convolution kernel of 1 ×, then passing through a set of convolution kernels that divide the feature map with n channels into 4 groups, the output features of the previous group being sent to the next group of convolution kernels of 3 ×, along with another set of input feature maps, and this process is repeated twice until all input feature maps are processed.

Preferably, the CRU units are 4 CRU structural units stacked end to end, and the formula of the superposition operation of the CRU units in step S22 is defined as:

wherein ,

wherein

Representing the resulting edge feature after the feature of Res2Net layer i has passed through n CRU structural units,

representing the significance of the generation of the feature after the feature of the ith layer in Res2Net passes through n CRU structural units,

indicating a dot product.

Preferably, step S23 is implemented as the following formula:

G_x(x,y)＝f(x,y)*g_x(x,y) (5)

G_y(x,y)＝f(x,y)*g_y(x,y) (6)

G＝F(G_x(x,y),G_y(x,y)) (7)

the function F in the formula (7) considers the edge characteristics in the vertical direction and the edge characteristics in the horizontal direction, and fuses the characteristics of the two parts in the horizontal direction and the vertical direction according to the formula (8) or (9);

F(G_x(x,y),G_y(x,y))＝G_x(x,y)+G_y(x,y) (9)

after the binary image passes through the formula (7), an edge feature map can be obtained; the loss function uses a binary cross-entropy loss, which is:

wherein p(x_i) Is true value, q (x)_i) Are estimated values.

Preferably, the step S3 includes performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure, wherein a network framework of the C3D network structure comprises 8 convolutional layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full-connection layers, and Re L u, BatchNormal and Droupout technologies can be added into the network structure to optimize the network structure;

the model uses a random gradient descent method, and the initial learning rate is 0.003; the total number of iterations is 100, and the learning rate is multiplied by 0.1 every 20 iterations to make the model converge; 4096 outputs are provided by the two fully-connected layers, and finally classification is realized by the softmax classification layer.

Compared with the prior art, the invention has the following beneficial effects:

the method combines the salient object detection method and the behavior recognition method for the first time, on one hand, the interesting region in the video is extracted, the main characteristics are reserved, the background characteristics are shielded, on the other hand, the calculation amount is reduced, and the method can be beneficial to the detection and recognition of abnormal behaviors.

Drawings

FIG. 1 is a schematic flow chart diagram of a video behavior recognition method based on salient feature extraction according to an embodiment;

FIG. 2 is a network architecture diagram of significance detection, according to an embodiment.

Fig. 3(a) is a unit structure diagram of a Res2Net backbone network according to an embodiment.

Fig. 3(b) is a unit structure diagram of a CRU unit according to an embodiment.

FIG. 4 is a diagram of a 3D convolution of an embodiment.

FIG. 5 is an exploded view of the R (2+1) D algorithm according to an embodiment.

Fig. 6 is a network framework diagram of a C3D network architecture of an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Visual Attention Mechanism (VA) refers to when facing a scene, a human automatically processes regions of interest and selectively ignores regions of no interest, which are called salient regions. And extracting the salient region of the specific target is called salient target detection. In the face of complex background information in the middle of a video or an image, the extraction of the salient features is necessary, and the main features can be reserved and the background features can be shielded. In the abnormal behavior detection, an abnormal part in an image or a video can be detected by using a saliency detection method, and then the abnormal behavior in the image or the video can be identified. The video behavior identification method based on the saliency feature extraction in the scheme is based on a network framework composed of a Res2Net backbone network, CRU units and asymmetric cross modules, and specifically comprises the following steps:

referring to fig. 1, a video behavior recognition method based on salient feature extraction includes:

s3, performing behavior recognition on the image;

s4, outputting the recognized abnormal behavior.

In the present embodiment, referring to fig. 2, step S2 includes:

s21, inputting the image into Res2Net backbone network, when the image passes through each layer of Res2Net backbone network, the image characteristic will be divided into saliency characteristic, edge characteristic, the Res2Net backbone network outputs saliency characteristic S0 and edge characteristic E0 finally; the Res2Net backbone network comprises 4 layers, layer1, layer2, layer3, layer 4. From a practical point of view, different objects may appear in the picture in different sizes, for example, the size of the table and the computer in different positions are different. Secondly, the object to be detected may carry more information than it itself occupies. The introduction of Res2Net backbone network means that each feature layer has multiple receptive fields with different scales, so that the perception of a brain on multiple scales and different directions of salient objects in real life is simulated, and the problem that an algorithm cannot detect that multiple abnormal behaviors exist in a video image is avoided.

Further, as shown in FIG. 3(a), step 21 includes the shallow image feature first passing through a convolution kernel of 1 × 1, then passing through a set of convolution kernels that divide the feature map with n channels into 4 groups, the output features of the previous group being sent to the next group of convolution kernels of 3 × 3 along with another set of input feature maps, this process being repeated twice until all input feature maps have been processed, finally, the feature maps from the 4 groups are concatenated, and the information is fused together through a convolution kernel of 1 × 1 to obtain the multi-scale feature.

S22, alternately training the saliency feature S0 and the edge feature E0 in the CRU unit as supervisory signals to generate the saliency feature S and the edge feature E; considering the logical relationship between saliency target detection and edge detection, a CRU structure that fuses saliency features and edge features may be employed. The CRU unit in this scheme is combined with the Res2Net network to create more discriminating characteristics. When an input image is subjected to multi-stage feature extraction by CNN, the deeper the CNN, the more the dispersion of image features is gradually suppressed. CNN refers to a general convolutional neural network, and also includes CRU structure. Given that there is more dispersion of attention in spatial detail where low-level features contain much background, while higher-level features are more concentrated on salient target regions, 4 CRU structural units, CRU1, CRU2, CRU3 and CRU4, are stacked in an end-to-end manner, as shown in fig. 3(b), the formula for the superposition operation of the CRU units in step S22 is defined as:

wherein ,

wherein

indicating a dot product.

S23, in the asymmetric crossing module, significant feature S and significant label L abel-S are lost, edge feature E and edge label L abel-E are lost, meanwhile, edge feature E is extracted from significant feature S, and edge feature E and the edge of significant label are lost.

For the edge extraction of the binary significant image, a traditional edge detection operator is adopted, such as Sobel and L OG, and the edge detection operator is formed by two convolution kernels g_x(x, y) and g_y(x, y) is obtained by performing convolution operation on the original image f (x, y).

The operators can be divided into templates in the vertical and horizontal directions, the former G_x(x, y) horizontal edges in the image, the latter G_y(x, y) then vertically oriented edges in the image can be detected. In practical application, each pixel point in the image is subjected to convolution operation by using the two convolution kernels, and the step S23 is specifically implemented as the following formula:

G_x(x,y)＝f(x,y)*g_x(x,y) (5)

G_y(x,y)＝f(x,y)*g_y(x,y) (6)

G＝F(G_x(x,y),G_y(x,y)) (7)

F(G_x(x,y),G_y(x,y))＝G_x(x,y)+G_y(x,y) (9)

wherein p(x_i) Is true value, q (x)_i) Are estimated values.

In the training process of the significance characteristic extraction, an experiment platform can adopt an operating system of Ubuntu 14.04.3L TS and a video card 1080Ti, and Python3.6 and Pythroch (0.4.0) are configured, an SGD algorithm is adopted as an optimization function, the iteration frequency is 30, the learning rate is 0.002, a learning rate plan is used after 20 iterations, the learning rate is multiplied by 0.1, the convergence process of a model is optimized, the batch size is set to be 8, a DUTS-TR data set is used as a training set, a test video is output of real-time video monitoring equipment, after the significance characteristic extraction is carried out on the monitoring video, background information is removed, and only behavior information of a significance region is reserved.

In the embodiment, as shown in fig. 6, step S3 includes performing behavior recognition on the image by using a behavior recognition algorithm based on a C3D network structure, wherein a network framework of the C3D network structure includes 8 convolutional layers composed of R (2+1) D convolutional modules, 5 max pooling layers, and 2 full connection layers, and Re L u, BatchNormal, and Droupout technologies may be added to the network structure to optimize the network structure;

At present, a plurality of human body behavior databases are provided, and video data sets such as Kinetics-400, Kinetics-600, Sport-1M and the like have larger data volume. The abnormal behavior data set may employ a vif (vialentflow) video database of violent behaviors about the population. The data sets can be used as training sets for training network parameters, so that the behavior recognition network can accurately recognize behaviors. The videos in the training set need to be sufficient and contain various behavior information.

With the development of deep learning, researchers have proposed a number of deep learning-based behavior recognition algorithms for extracting spatiotemporal features in video: based on a dual stream network architecture, based on a C3D network architecture.

And (3) behavior recognition based on a double-flow network structure, wherein the double-flow network is a milestone for video character behavior analysis in deep learning, 5 layers of convolution layers and two layers of full connection layers are adopted, and static images are expanded to video data, such as UCF101 and HMDB. In this network, the spatial streams carry behavioral information as input to the RGB graph, and the temporal streams carry timing information as input to the optical flow graph. With the proposal of the VGG network structure, the VGG16 network is taken as a feature extraction network, and the fusion state of the dual-flow network in time and space is considered. However, when the dual-flow structure is used, the optical flow image must be generated first, the consumed time is long, and the real-time performance is poor, so that a behavior recognition algorithm based on the C3D network structure can be adopted.

The 3D convolution is formed by extending the time dimension by the 2D convolution, and the frame diagram is shown in fig. 4: the convolution kernel is correspondingly expanded to 3 dimensions, the frame sequence sequentially passes through the convolution kernel, three continuous frames of images pass through the convolution kernel with the depth of 3, and finally the characteristic value is mapped to the characteristic diagram. After the feature map is connected with the frame sequence, the motion features of the person in the video can be obtained, such as the formulas (11), (12):

in the formula ：

and (4) representing the characteristic value of the ith layer after the jth characteristic diagram pixel point (x, y, z) is subjected to 3D convolution. b_ijFor the variance, m represents the number of feature maps in the previous layer, R, P, Q corresponds to the depth (time), length, and width of the 3D convolution kernel, respectively, and w is the weight of the convolution kernel connected by the feature maps.

The time-series relation between a single frame image and a single frame image can be utilized due to the introduction of the 3D convolution, namely, the time-series characteristic can be extracted while the spatial characteristic is extracted, so that the problem of carrying time flow by using optical flow does not exist, but the problems of calculation cost and model storage are brought about.A network structure based on the 3D convolution needs to be designed by considering that the network can effectively extract space-time characteristics and the calculation cost is as low as possible, and the model storage is as small as possible.R (2+1) D algorithm replaces a 3 × 3 × 3 convolution kernel by a spatial convolution kernel of 1 × 3 × 3 and a time convolution kernel of 3 × 1 × 1, the schematic diagram 5 is decomposed, and as can be understood from the calculation mode of the convolution kernel, the 1 × 3 × 3 convolution kernel operates on a two-dimensional image at a single time, and the convolution kernel of 3 × 1 × 1 operates only in a time dimension.

In summary, the real scene is collected by the camera and transmitted to the monitoring system, after being transmitted by the monitoring system, the salient features of the video are extracted, the background information in the video is removed, the behavior is identified through the behavior identification network, and finally the abnormal behavior is output.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A video behavior identification method based on salient feature extraction is characterized by comprising the following steps:

s3, performing behavior recognition on the image;

s4, outputting the recognized abnormal behavior.

2. The video behavior recognition method based on salient feature extraction according to claim 1, wherein the step S2 comprises:

3. The video behavior recognition method based on salient feature extraction according to claim 2, wherein the step 21 comprises:

the shallow image features first pass through a convolution kernel of 1 × 1, then pass through a set of convolution kernels that divide the feature map with n channels into 4 groups, the output features of the previous group are sent to the next group of convolution kernels of 3 × 3 along with another group of input feature maps, this process is repeated twice until all input feature maps are processed, finally, the feature maps from the 4 groups are concatenated, and the information is fused together through the convolution kernel of 1 × 1, resulting in a multi-scale feature.

4. The video behavior recognition method based on salient feature extraction according to claim 2, wherein the CRU units are 4 CRU structural units stacked end to end, and the formula of the superposition operation of the CRU units in step S22 is defined as:

wherein ,

wherein

indicating a dot product.

5. The video behavior recognition method based on salient feature extraction according to claim 2, wherein the step S23 is implemented as the following formula:

G_x(x,y)＝f(x,y)*g_x(x,y) (5)

G_y(x,y)＝f(x,y)*g_y(x,y) (6)

G＝F(G_x(x,y),G_y(x,y)) (7)

F(G_x(x,y),G_y(x,y))＝G_x(x,y)+G_y(x,y) (9)

wherein p(x_i) Is true value, q (x)_i) Are estimated values.

6. The video behavior recognition method based on salient feature extraction according to claim 1, wherein the step S3 comprises performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure, wherein a network framework of the C3D network structure comprises 8 convolutional layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full-connection layers, and Re L u, BatchNormal and Droupout technologies can be added into the network structure to optimize the network structure;