CN111488805B

CN111488805B - Video behavior recognition method based on salient feature extraction

Info

Publication number: CN111488805B
Application number: CN202010210957.4A
Authority: CN
Inventors: 胡晓; 向俊将; 杨佳信
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2020-03-24
Filing date: 2020-03-24
Publication date: 2023-04-25
Anticipated expiration: 2040-03-24
Also published as: CN111488805A

Abstract

The invention relates to a video behavior recognition method based on salient feature extraction, which comprises the following steps of S1, obtaining a video to be recognized, and converting the video to be recognized into an image; s2, extracting salient features of the image and removing background information in the image; s3, performing behavior recognition on the image; s4, outputting the identified abnormal behavior. According to the method, the salient target detection method and the behavior recognition method are combined for the first time, on one hand, the interested region in the video is extracted, the main characteristics are reserved, the background characteristics are shielded, and on the other hand, the operation amount is reduced.

Description

Video behavior recognition method based on salient feature extraction

Technical Field

The invention relates to the technical field of intelligent video monitoring, in particular to a video behavior identification method based on salient feature extraction.

Background

With the development of economy and the perfection of legal systems, people are more concerned about criminal behaviors for preventing life and property safety. Video monitoring systems are beginning to be applied to people's life, such as theft protection in daily life, prevention of terrorist events in dense people, etc.

At present, the behavior recognition method mainly comprises two types: methods based on traditional feature extraction and methods based on deep learning. The conventional feature extraction-based method classifies behaviors by extracting various features such as an optical flow Histogram (HOF), a gradient Histogram (HOG), a Motion Boundary Histogram (MBH) and the like in a video. However, the recognition capability is easily affected by illumination intensity and background information, and the extracted features have certain limitations and do not have good recognition capability.

With the development of the age, researchers have proposed deep learning, and in view of the fact that deep learning can effectively complete tasks in research fields such as vision and hearing, a method based on deep learning is applied to real life in the market. Video surveillance is one of these. Researchers build a good network model based on the deep learning theory, and train a data set through a video with marks to obtain a model with identification capability. The model has better generalization capability and can classify untrained video data. The current practice directly inputs the video into the neural network, does not process the video data, or only exposes and distorts the video, but does not highlight the characteristic information carried by the video, which is not beneficial to the detection of abnormal behaviors. In a truly complex background, it may result in an inability to identify abnormal behavior. The existing training based on the deep learning network model requires a large number of data sets and high-performance servers, which brings great limitation to actual video identification work.

In summary, there is an urgent need in the industry to develop a method or system that can highlight feature information carried by video, reduce the amount of computation, and facilitate detection and identification of abnormal behaviors.

Disclosure of Invention

Aiming at the defect that the characteristic information carried by the video is not highlighted in the prior art, the invention designs a video behavior recognition method based on salient feature extraction.

The specific scheme of the application is as follows:

a video behavior recognition method based on salient feature extraction comprises the following steps:

s1, acquiring a video to be identified, and converting the video to be identified into an image;

s2, extracting salient features of the image and removing background information in the image;

s3, performing behavior recognition on the image;

s4, outputting the identified abnormal behavior.

Preferably, step S2 includes:

s21, inputting an image into a Res2Net backbone network, dividing the image characteristics into salient characteristics and edge characteristics when the image passes through each layer in the Res2Net backbone network, and finally outputting the salient characteristics S0 and the edge characteristics E0 by the Res2Net backbone network;

s22, alternately training the salient features S0 and the edge features E0 in a CRU unit as supervision signals to generate salient features S and edge features E;

s23, in the asymmetric cross module, the saliency feature S and the saliency Label Label-S are lost, the edge feature E and the edge Label-E are lost, meanwhile, the edge feature E is extracted from the saliency feature S, and the edge feature E and the edge of the saliency Label are lost.

Preferably, step 21 comprises: the shallow image features first pass through a 1 x 1 convolution kernel, then pass through a set of convolution kernels that divide the feature map with n channels into 4 sets, the output features of the former set are sent to the next set of 3 x 3 convolution kernels along with another set of input feature maps, and this process is repeated twice until all input feature maps are processed. Finally, the feature maps from the 4 groups are connected, and the information is fused together through a convolution kernel of 1×1 to obtain the multi-scale feature.

Preferably, the CRU units are 4 CRU structural units stacked end to end, and the formula of the stacking operation of the CRU units in step S22 is defined as follows:

wherein ,

wherein

Generating edge features after the features representing Res2Net ith layer pass through n CRU building blocks,/for>

Representing the saliency of the generation of features of the ith layer in Res2Net after n CRU building blocks,/for>

Representing dot product.

Preferably, the implementation of step S23 is as follows:

G _x (x,y)＝f(x,y)*g _x (x,y) (5)

G _y (x,y)＝f(x,y)*g _y (x,y) (6)

G＝F(G _x (x,y),G _y (x,y)) (7)

the function F in the formula (7) takes into consideration the edge feature in the vertical direction and the edge feature in the horizontal direction, and fuses the features in both the horizontal direction and the vertical direction according to the formula (8) or (9);

F(G _x (x,y),G _y (x,y))＝G _x (x,y)+G _y (x,y) (9)

after the binary image passes through the formula (7), an edge feature map can be obtained; the loss function uses a binary cross entropy loss, which is:

wherein p(x_i ) True value, q (x _i ) Is an estimated value.

Preferably, step S3 includes: performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure; the network frame of the C3D network structure consists of 8 convolution layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full connection layers, and ReLu, batchNormal, droupout technology can be added in the network structure to optimize the network structure;

the model uses a random gradient descent method, and the initial learning rate is 0.003; for a total of 100 iterations, the learning rate is multiplied by 0.1 for each iteration by 20 to converge the model; there are 4096 outputs from the two fully connected layers, and finally classification is achieved by the softmax classification layer.

Compared with the prior art, the invention has the following beneficial effects:

according to the method, the salient target detection method and the behavior recognition method are combined for the first time, on one hand, the interested region in the video is extracted, the main characteristics are reserved, the background characteristics are shielded, and on the other hand, the operation amount is reduced.

Drawings

FIG. 1 is a schematic flow chart of a video behavior recognition method based on salient feature extraction according to one embodiment;

fig. 2 is a network configuration diagram of saliency detection according to an embodiment.

Fig. 3 (a) is a part of a unit structure diagram of a Res2Net backbone network according to an embodiment, and (b) is a part of a unit structure diagram of a CRU unit according to an embodiment.

Fig. 4 is a 3D convolution diagram of an embodiment.

FIG. 5 is an exploded view of an R (2+1) D algorithm according to an embodiment.

Fig. 6 is a network frame diagram of a C3D network structure according to an embodiment.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Visual salience (Visual Attention Mechanism, VA, visual attention mechanism) refers to the automatic processing of regions of interest by humans in the face of a scene, selectively ignoring regions of no interest, these regions of interest being referred to as salience regions. And extracting a salient region of a specific target is called salient target detection. In the face of complex background information in the middle of video or image, the extraction of salient features is necessary, and the main features can be reserved and the background features can be shielded. In the abnormal behavior detection, an abnormal portion in an image or video may be detected by a method of saliency detection, and then the abnormal behavior in the image or video may be identified. The video behavior recognition method based on the salient feature extraction in the scheme is based on a network framework consisting of a Res2Net backbone network, CRU units and asymmetric cross modules, and specifically comprises the following steps:

referring to fig. 1, a video behavior recognition method based on salient feature extraction includes:

s3, performing behavior recognition on the image;

s4, outputting the identified abnormal behavior.

In this embodiment, referring to fig. 2, step S2 includes:

s21, inputting an image into a Res2Net backbone network, dividing the image characteristics into salient characteristics and edge characteristics when the image passes through each layer in the Res2Net backbone network, and finally outputting the salient characteristics S0 and the edge characteristics E0 by the Res2Net backbone network; the Res2Net backbone network comprises 4 layers, layer1, layer2, layer3, layer4, respectively. From a practical point of view, different objects may appear in the picture in different sizes, e.g. different dining tables in different positions and the size of the computer are different. Second, the object to be detected may carry more information than it occupies itself. The introduction of the Res2Net backbone network means that each feature layer has a plurality of receptive fields with different scales, and the perception of the brain on the salient targets with different scales and different directions in real life is simulated, so that the problem that an algorithm cannot detect a plurality of abnormal behaviors in a video with images is avoided.

Still further, as shown in part (a) of fig. 3, step 21 includes: the shallow image features first pass through a 1 x 1 convolution kernel, then pass through a set of convolution kernels that divide the feature map with n channels into 4 sets, the output features of the former set are sent to the next set of 3 x 3 convolution kernels along with another set of input feature maps, and this process is repeated twice until all input feature maps are processed. Finally, the feature maps from the 4 groups are connected, and the information is fused together through a convolution kernel of 1×1 to obtain the multi-scale feature.

S22, alternately training the salient features S0 and the edge features E0 in a CRU unit as supervision signals to generate salient features S and edge features E; considering the logical relationship between saliency target detection and edge detection, a CRU structure that fuses saliency features and edge features may be employed. The CRU units in this scheme are combined with the Res2Net network to create further distinguishing features. When the input image is subjected to multistage feature extraction through CNN, the deeper the CNN is, the dispersion of image features is gradually suppressed. Wherein CNN refers to a common convolutional neural network and also comprises a CRU structure. Whereas the lower level features contain many background spatial details with more attention dispersion and the higher level features are more focused on the salient target region, 4 CRU structural units, CRU1, CRU2, CRU3 and CRU4, respectively, are stacked in an end-to-end manner, and as shown in part (b) of fig. 3, the formula of the stacking operation of the CRU units in step S22 is defined as:

wherein ,

wherein

Representing dot product.

S23, in the asymmetric cross module, the saliency feature S and the saliency Label Label-S are lost, the edge feature E and the edge Label-E are lost, meanwhile, the edge feature E is extracted from the saliency feature S, and the edge feature E and the edge of the saliency Label are lost. Although in the CRU unit structure, the salient feature S combines the edge feature E in order to compensate for the edge information lost by the salient object, the effect of this combination on the salient feature S is limited, and the promoting effect of the edge feature E on the salient feature S cannot be directly estimated. Considering from the output layer of the network structure, the saliency target detection only evaluates the saliency feature S, and the edge feature E only serves as an auxiliary function, which is unfavorable for the extraction of the edge feature of the saliency feature S, so that besides the CRU structure, the scheme also extracts the edge information of the saliency target in the training process and loses the edge information with the edge label information, thereby realizing the double cross fusion of the saliency target detection network and the edge detection network.

For edge extraction of binary saliency images, traditional edge detection operators such as Sobel and LOG are adopted, and the edge detection operators are composed of two convolution kernels g _x (x, y) and g _y (x, y) convolution operation is performed on the original image f (x, y).

The operators can be divided into templates in the vertical direction and in the horizontal direction, the former G _x (x, y) can detect the edges in the horizontal direction in the image, the latter G _y (x, y) then the edges in the vertical direction in the image can be detected. In practical application, each pixel point in the image is convolved by using the two convolution kernels, and the specific implementation of step S23 is as follows:

G _x (x,y)＝f(x,y)*g _x (x,y) (5)

G _y (x,y)＝f(x,y)*g _y (x,y) (6)

G＝F(G _x (x,y),G _y (x,y)) (7)

F(G _x (x,y),G _y (x,y))＝G _x (x,y)+G _y (x,y) (9)

wherein p(x_i ) True value, q (x _i ) Is an estimated value.

In the training process of the salient feature extraction, an experimental platform can adopt an operating system of Ubuntu14.04.3LTS, a display card 1080Ti and configure Python3.6 and Pytorch (0.4.0). The SGD algorithm is adopted as an optimization function, the iteration number is 30, the learning rate is 0.002, the learning rate plan is used after the iteration is 20 times, the learning rate is multiplied by 0.1, the convergence process of the optimization model is set to be 8. The DUTS-TR dataset is used as the training set. The test video is the output of the video monitoring device in reality. After the salient features of the monitoring video are extracted, the background information is removed, and only the behavior information of the salient region is reserved.

In this embodiment, as shown in fig. 6, step S3 includes: performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure; the network frame of the C3D network structure consists of 8 convolution layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full connection layers, and ReLu, batchNormal, droupout technology can be added in the network structure to optimize the network structure;

At present, a large number of human behavior databases exist, and the large data volume is provided with video data sets such as Kinetics-400, kinetics-600, sport-1M and the like. The abnormal behavior data set may employ a VIF (violent flow) video database of violent behaviors about the crowd. The data sets can be used as training sets for training network parameters, so that the behavior recognition network can accurately recognize behaviors. The videos in the training set need to be enough to contain various behavior information.

With the development of deep learning, researchers have proposed numerous behavior recognition algorithms based on deep learning for extracting spatiotemporal characteristics in video: based on a dual-flow network structure and based on a C3D network structure.

Behavior recognition based on a double-flow network structure, wherein the double-flow network is a milestone for analyzing the behavior of a video character in deep learning, adopts a 5-layer convolution layer and a two-layer full-connection layer, and expands a static image to video data, such as UCF101 and HMDB. In this network, the spatial stream takes the RGB diagram as input and the temporal stream takes the optical flow diagram as input and carries the timing information. With the proposal of the VGG network structure, the VGG16 network is taken as a characteristic extraction network, and the fusion state of the double-flow network in time and space is considered. However, when the dual-flow structure is used, an optical flow image must be generated first, the consumed time is relatively long, and the real-time performance is relatively poor, so that a behavior recognition algorithm based on the C3D network structure can be adopted.

The 3D convolution is formed by a 2D convolution spread time dimension, with a frame diagram as shown in fig. 4: the convolution kernel is correspondingly expanded to 3 dimensions, the frame sequence sequentially passes through the convolution kernel, three continuous images pass through the convolution kernel with the depth of 3, and finally, the characteristic values are mapped to the characteristic images. After the feature map is connected to the frame sequence, it is possible to obtain the motion features of the person in the video, as shown in formulas (11), (12):

in the formula ：

and the characteristic value of the ith layer and the jth characteristic image pixel point (x, y, z) after 3D convolution is represented. b _ij For deviation, m represents the previous layerThe number of the feature graphs R, P, Q corresponds to the depth (time), length and width of the 3D convolution kernel, and w is the convolution kernel weight of the feature graph connection.

Because of the introduction of 3D convolution, the time sequence relation between single frame images can be utilized, namely, the time sequence characteristics can be extracted while the spatial characteristics are extracted, so that the use of optical flow to bear time flow does not exist. But brings with it the problems of computational costs and model storage. Therefore, when designing a network structure based on 3D convolution, the network is considered to be capable of effectively extracting space-time characteristics, and the calculation cost is considered to be as low as possible, and the model storage is as small as possible. The R (2 + 1) D algorithm replaces the 3 x 3 convolution kernel with a 1 x 3 spatial convolution kernel and a 3 x 1 temporal convolution kernel, the decomposition of which is schematically illustrated in fig. 5, as can be understood from the way the convolution kernel is calculated, the 1×3×3 convolution kernel operates on a two-dimensional image at a single time, while the 3×1×1 convolution kernel operates only in the time dimension.

In conclusion, the real scene is collected and transmitted to the monitoring system through the camera, after being transmitted by the monitoring system, the video is subjected to salient feature extraction, background information in the video is removed, behaviors are identified through the behavior identification network, and finally abnormal behaviors are output.

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. The video behavior recognition method based on the salient feature extraction is characterized by comprising the following steps of:

the step S2 includes:

s23, in the asymmetric cross module, the significant feature S and the significant Label Label-S are lost, the edge feature E and the edge Label-E are lost, meanwhile, the edge feature E is extracted from the significant feature S, and the edge feature E and the edge of the significant Label are lost;

the CRU units are 4 CRU structural units stacked end to end, and in step S22, the formula of the stacking operation of the CRU units is defined as follows:

wherein ,

wherein

Representing dot product;

s3, performing behavior recognition on the image;

s4, outputting the identified abnormal behavior.

2. The method for identifying video behavior based on salient feature extraction of claim 1, wherein step 21 comprises:

the shallow image features firstly pass through a convolution kernel of 1 multiplied by 1, then pass through a convolution kernel group which divides a feature map with n channels into 4 groups, the output features of the former group and the input feature map of the other group are transmitted to the convolution kernel of the next group of 3 multiplied by 3, and the process is repeated twice until all the input feature maps are processed; finally, the feature maps from the 4 groups are connected, and the information is fused together through a convolution kernel of 1×1 to obtain the multi-scale feature.

3. The video behavior recognition method based on salient feature extraction of claim 1, wherein the implementation of step S23 is as follows:

G _x (x,y)＝f(x,y)*g _x (x,y) (5)

G _y (x,y)＝f(x,y)*g _y (x,y) (6)

G＝F(G _x (x,y),G _y (x,y)) (7)

F(G _x (x,y),G _y (x,y))＝G _x (x,y)+G _y (x,y) (9)

obtaining an edge feature map after the binary image passes through a formula (7); the loss function uses a binary cross entropy loss, which is:

wherein p(x_i ) True value, q (x _i ) Is an estimated value.

4. The video behavior recognition method based on salient feature extraction of claim 1, wherein step S3 comprises: performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure; the network frame of the C3D network structure is composed of 8 convolution layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full connection layers, and ReLu, batchNormal, droupout technology is added in the network structure to optimize the network structure;