CN110287826B

CN110287826B - Video target detection method based on attention mechanism

Info

Publication number: CN110287826B
Application number: CN201910499786.9A
Authority: CN
Inventors: 李建强; 白骏; 刘雅琦
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2021-09-17
Anticipated expiration: 2039-06-11
Also published as: CN110287826A

Abstract

The invention relates to a video target detection method based on an attention mechanism, and relates to computer vision. The invention comprises the following steps: step S1, extracting a candidate feature map of the current time frame; step S2, setting a fusion window in the past time period, calculating the Laplacian variance of each frame in the window, normalizing the variance as the weight of each frame in the window, carrying out weighted summation on the candidate feature maps of all frames in the window to obtain a time sequence feature, and connecting the candidate feature of the current time frame with the time sequence feature to obtain a feature map to be detected; step S3, extracting a feature map with an additional scale on the feature map to be detected by using the convolution layer; in step S4, the object type and position are predicted by using the convolutional layer on the feature maps of different scales. The feature fusion method of the invention distributes different weights to the frame features with different qualities in the past time period, so that the fusion of the time sequence information is more sufficient, and the performance of the detection model is improved.

Description

Video target detection method based on attention mechanism

Technical Field

The invention relates to computer vision, deep learning and video target detection technology.

Background

The image target detection method based on deep learning has made great progress in the last five years, such as the RCNN series network, the SSD network and the YOLO series network. However, in the fields of video surveillance, vehicle-assisted driving and the like, video-based target detection has a wider demand. Due to the problems of motion blur, shielding, shape change diversity, illumination change diversity and the like in the video, a good detection result cannot be obtained only by using an image target detection technology to detect the target in the video. The adjacent frames in the video have continuity in time and similarity in space, the positions of the targets between the frames are related, and how to utilize the target time sequence information in the video becomes the key for improving the video target detection performance.

The current video target detection framework mainly comprises three types: a method for detecting video frames as independent images by using an image target detection algorithm ignores time information and independently detects each frame, so that the effect is not ideal; the other method combines target detection and target tracking technology, the method carries out post-processing on the detection result so as to track the target, the tracking precision depends on the detection, and error propagation is easy to cause; there is also a method of detecting only a few key frames and then generating the features of the remaining frames using optical flow information and key frame features, and this method uses time series information but the optical flow is very expensive to calculate and difficult to detect quickly.

Disclosure of Invention

The invention aims to provide a rapid and accurate video target detection method which fully integrates time sequence characteristics.

In order to solve the technical problem, the invention provides a video target detection method based on an attention mechanism, which comprises the following steps:

step S1, inputting the video frame image of the current time point into a Mobilenet network to extract candidate characteristic graphs;

step S2, setting a time sequence feature fusion window in the past time period adjacent to the current time point, respectively calculating the Laplacian variance of the images of the video frames to be fused in the feature fusion window, normalizing the Laplacian variance, taking the normalized Laplacian variance as the fusion weight of each frame to be fused, carrying out weighted summation on the candidate feature graphs of all the frames to be fused according to the fusion weight to obtain the time sequence feature required by the current frame, and connecting the candidate feature of the video frame at the current time point with the channel dimension of the time sequence feature in the feature to obtain the feature graph to be detected fused with the time sequence information;

step S3, extracting the feature diagram to be detected with extra scales on the feature diagram to be detected by utilizing the convolution feature extraction layer and the maximum pooling layer;

and step S4, on the feature maps to be detected with different scales, predicting the object class and the boundary frame coordinates on the current frame by using the convolutional layer.

Further, in step S1, the video frame at the current time point t is detected, and first, the video frame image I at the current time point is detected_tInputting into a Mobilenet network for feature extraction, wherein

H_IAnd W_IExtracting candidate characteristic graphs respectively corresponding to the height and the width of the video frame

Represents a real number, C₁，H₁And W₁The number of feature channels, the height and the width of the candidate feature map are respectively.

Further, in step S2, a feature fusion window with a width w of S is set in the past time period of the current time point t, and the video frame images to be fused in the feature fusion window are: { I_t-i}i∈[1，s]And the candidate feature map corresponding to the video frame to be fused in the feature fusion window is as follows: { F_t-i}i∈[1，s]. Each video frame image I to be fused is processed_t-iConversion into a grey-scale map G_t-iAnd calculating the Laplace variance of the image on the basis of the gray level map, wherein the Laplace operator at the G coordinate (x, y) of the gray level map is

The Laplace operator of the image captures an area with a rapidly changed pixel value in the image by calculating a second derivative of each pixel point of the image in each direction, and can be used for detecting corners in the image, the Laplace variance of the image reflects the pixel value change condition of the whole image, if the Laplace variance is large, the image is clear, and otherwise, the image is fuzzy.

First, each gray scale map G is calculated_t-iIs a laplace mean of

H_IAnd W_IHeight and width of the grey scale map respectively:

next, each gray-scale image G is calculated_t-iVariance of laplacian

If a video frame is sharp, its candidate features contribute to the detection of objects, whereas some frames cause image blurring due to moving objects. The candidate characteristics of the frames are not beneficial to detecting the target, different fusion weights should be allocated to the video frames with different definition degrees, so that the detection model focuses more on the clear characteristics rather than the fuzzy characteristics, and the fusion weight alpha of all the video frames to be fused is calculated firstly_t-i：

Fusing the frame candidate features in the feature fusion window in a weighted summation mode to obtain the time sequence features of the current time point

And connecting the time sequence characteristics with the candidate characteristics of the current frame in a channel dimension to complete the fusion of time sequence information and obtain a first characteristic diagram to be detected for detection.

Further, in step S3, a feature map to be detected is obtained at the current time point, in which the time series feature is fused

Then, in order to obtain more scales of characteristic diagrams to be detected, a 3 x 3 convolutional layer and a 2 x 2 pooling layer are utilized to perform further characteristic extraction on the characteristic diagrams to be detected and reduce the size of the characteristic diagrams to be detected, so that local information in the characteristic diagrams to be detected with large size is rich, the characteristic diagrams to be detected with small size are suitable for predicting small-size targets, the characteristic diagrams to be detected with small size contain stronger global semantic information and are suitable for detecting targets with large size, and e-1 times of characteristic extraction are performed to finally obtain e characteristic diagrams to be detectedCharacteristic diagram to be detected:

further, in step S4, a multi-scale feature map to be detected is obtained through additional feature extraction, anchor frames with prior positions are set on the feature maps to be detected of different scales, and the offset of the target boundary frame relative to the anchor frame and the type of the target are respectively performed on the feature maps to be detected by using two 3 × 3 convolutional layers by using the channel dimension. Let the number of classes be d (including background), for each feature map to be detected

Obtaining a classification prediction result after prediction of a 3 multiplied by 3 convolution type prediction layer and a 3 multiplied by 3 convolution bounding box prediction layer

And bounding box prediction results

Drawings

FIG. 1 is a schematic of the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic views, and merely illustrate the basic structure of the present invention in a schematic manner, and thus show only the constitution related to the present invention.

Example 1

As shown in fig. 1, the present example provides a video object detection method based on attention mechanism, comprising the following steps

step S2, setting a time sequence feature fusion window in the past time period adjacent to the current time point, respectively calculating the Laplacian variance of the images of the video frames to be fused in the feature fusion window, normalizing the Laplacian variance to serve as the fusion weight of each frame to be fused, carrying out weighted summation on the candidate feature graphs of all the frames to be fused according to the weight to obtain the time sequence feature required by the current frame, and connecting the candidate feature of the video frame at the current time point with the time sequence feature in a channel dimension to obtain the feature graph to be detected fused with time sequence information;

In step S1, detecting the current time point t video frame first, the current time point t video frame image I is detected_tInputting the Mobilene to perform feature extraction, wherein

H_IAnd W_IRespectively the height and width of the frame image, and extracting candidate feature maps

Wherein C is₁，H₁，W₁The number of channels, height and width of the candidate feature map are respectively.

In step S2, a feature fusion window with width w of S is set in the past time slot of the current time point t, and the length of the past time slot is q, the setting rule of the feature fusion window width is as follows, that is, if the length of the past time slot is greater than S, the fusion window width is set to S, and if the length of the past time step is less than S and there is not enough features, the fusion window width is set to the length of the past time step.

And making the video frame image to be fused in the feature fusion window as: { I_t-i}i∈[1，s]Video frames to be fused within a feature fusion windowThe corresponding candidate feature maps are: { F_t-i}i∈[1，s]. Each video frame image I to be fused is processed_t-iConversion into a grey-scale map G_t-iAnd calculating the Laplace variance of the image on the basis of the gray level image, wherein the Laplace operator at the G coordinate (x, y) of the gray level image is as follows:

where G (x, y) represents the pixel value of the grayscale map G at the coordinates (x, y). The Laplace operator of the image captures an area with a rapidly changed pixel value in the image by calculating a second derivative of each pixel point of the image in each direction, and can be used for detecting corners in the image, the Laplace variance of the image reflects the pixel value change condition of the whole image, if the Laplace variance is large, the image is clear, and otherwise, the image is fuzzy.

First, each gray scale map G is calculated_t-iIs a laplace mean of

H_IAnd W_IRespectively the height and width of the grey scale map.

Next, each gray-scale image G is calculated_t-iVariance of laplacian

If a video frame is sharp, its candidate features contribute to the detection of objects, whereas some frames cause image blurring due to moving objects. The candidate characteristics of the frames are not beneficial to detecting the target, and the video with different definition degrees is obtainedThe frames should be assigned different fusion weights, the clearer frame feature weight is larger, so that the detection model focuses more on the clear features rather than the fuzzy features, and the fusion weights alpha of all the video frames to be fused are calculated firstly_t-i：

Connecting the time sequence characteristics with the candidate characteristics of the current frame in the channel dimension to complete the fusion of the time sequence information and obtain the first characteristic diagram to be detected for detection

In step S3, a feature map to be detected is obtained in which the time sequence features are fused at the current time point

Then, in order to obtain a characteristic diagram to be detected with more scales, the feature diagram to be detected is further extracted by utilizing the convolution layer and the pooling layer, and the size of the characteristic diagram to be detected is reduced, so that the local information in the characteristic diagram to be detected with large size is rich, the characteristic diagram to be detected with small size is suitable for predicting small-size targets, the characteristic diagram to be detected with small size contains stronger global semantic information, and is suitable for detecting targets with large size through e-1 times of feature extraction,finally obtaining e characteristic graphs to be detected:

in the step S4, a multi-scale feature map to be detected is obtained through additional feature extraction, anchor frames with prior positions are set on the feature maps to be detected of different scales, and the offset of the target boundary frame relative to the anchor frame and the type of the target are respectively performed on the feature maps to be detected by using two convolutional layers by using the channel dimensions. Let the number of classes be d (including background), for each feature map to be detected

Wherein C is_Fi，H_Fi，W_FiThe number of channels, height and width of the feature map are respectively, and the number of anchor frames of each pixel position is n_iObtaining classification prediction results after prediction of convolution type prediction layer and convolution boundary frame prediction layer

And bounding box prediction results

Claims

1. A video target detection method based on an attention mechanism is characterized by comprising the following steps:

step S1, inputting the video frame image of the current time point into a Mobilene to extract a candidate feature map;

2. The attention mechanism-based video object detection method of claim 1,

in step S1, the current time point t video frame is detected by first detecting the current time point t video frame image I_tInputting the feature into a Mobilenet network to extract features to obtain a candidate feature map F_t(ii) a Wherein

Represents a real number, C₁,H₁And W₁The number of feature channels, the height and the width of the candidate feature map are respectively.

3. The attention mechanism-based video object detection method of claim 2,

in step S2, a feature fusion window with a width w of S is set in the past time period of the current time point t, and the video frame images to be fused in the feature fusion window are: { I_t-i}i∈[1,s]And the candidate feature map corresponding to the video frame to be fused in the feature fusion window is as follows: { F_t-i}i∈[1,s](ii) a Each video frame image I to be fused is processed_t-iConversion into a grey-scale map G_t-i；

Calculate each gray map G_t-iVariance of laplacian

Calculating fusion weight alpha of all video frames to be fused by normalizing Laplace variance_t-i(ii) a Fusing the frame candidate features in the feature fusion window in a weighted summation mode to obtain the time sequence features of the current time point

4. The attention mechanism-based video object detection method of claim 3,

And then, performing further feature extraction on the feature diagram to be detected by utilizing a 3 × 3 convolutional layer and a 2 × 2 pooling layer, simultaneously reducing the size of the feature diagram to be detected, and finally obtaining e feature diagrams to be detected through e-1 feature extraction: