CN110287826B - Video target detection method based on attention mechanism - Google Patents
Video target detection method based on attention mechanism Download PDFInfo
- Publication number
- CN110287826B CN110287826B CN201910499786.9A CN201910499786A CN110287826B CN 110287826 B CN110287826 B CN 110287826B CN 201910499786 A CN201910499786 A CN 201910499786A CN 110287826 B CN110287826 B CN 110287826B
- Authority
- CN
- China
- Prior art keywords
- feature
- detected
- frame
- candidate
- fused
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a video target detection method based on an attention mechanism, and relates to computer vision. The invention comprises the following steps: step S1, extracting a candidate feature map of the current time frame; step S2, setting a fusion window in the past time period, calculating the Laplacian variance of each frame in the window, normalizing the variance as the weight of each frame in the window, carrying out weighted summation on the candidate feature maps of all frames in the window to obtain a time sequence feature, and connecting the candidate feature of the current time frame with the time sequence feature to obtain a feature map to be detected; step S3, extracting a feature map with an additional scale on the feature map to be detected by using the convolution layer; in step S4, the object type and position are predicted by using the convolutional layer on the feature maps of different scales. The feature fusion method of the invention distributes different weights to the frame features with different qualities in the past time period, so that the fusion of the time sequence information is more sufficient, and the performance of the detection model is improved.
Description
Technical Field
The invention relates to computer vision, deep learning and video target detection technology.
Background
The image target detection method based on deep learning has made great progress in the last five years, such as the RCNN series network, the SSD network and the YOLO series network. However, in the fields of video surveillance, vehicle-assisted driving and the like, video-based target detection has a wider demand. Due to the problems of motion blur, shielding, shape change diversity, illumination change diversity and the like in the video, a good detection result cannot be obtained only by using an image target detection technology to detect the target in the video. The adjacent frames in the video have continuity in time and similarity in space, the positions of the targets between the frames are related, and how to utilize the target time sequence information in the video becomes the key for improving the video target detection performance.
The current video target detection framework mainly comprises three types: a method for detecting video frames as independent images by using an image target detection algorithm ignores time information and independently detects each frame, so that the effect is not ideal; the other method combines target detection and target tracking technology, the method carries out post-processing on the detection result so as to track the target, the tracking precision depends on the detection, and error propagation is easy to cause; there is also a method of detecting only a few key frames and then generating the features of the remaining frames using optical flow information and key frame features, and this method uses time series information but the optical flow is very expensive to calculate and difficult to detect quickly.
Disclosure of Invention
The invention aims to provide a rapid and accurate video target detection method which fully integrates time sequence characteristics.
In order to solve the technical problem, the invention provides a video target detection method based on an attention mechanism, which comprises the following steps:
step S1, inputting the video frame image of the current time point into a Mobilenet network to extract candidate characteristic graphs;
step S2, setting a time sequence feature fusion window in the past time period adjacent to the current time point, respectively calculating the Laplacian variance of the images of the video frames to be fused in the feature fusion window, normalizing the Laplacian variance, taking the normalized Laplacian variance as the fusion weight of each frame to be fused, carrying out weighted summation on the candidate feature graphs of all the frames to be fused according to the fusion weight to obtain the time sequence feature required by the current frame, and connecting the candidate feature of the video frame at the current time point with the channel dimension of the time sequence feature in the feature to obtain the feature graph to be detected fused with the time sequence information;
step S3, extracting the feature diagram to be detected with extra scales on the feature diagram to be detected by utilizing the convolution feature extraction layer and the maximum pooling layer;
and step S4, on the feature maps to be detected with different scales, predicting the object class and the boundary frame coordinates on the current frame by using the convolutional layer.
Further, in step S1, the video frame at the current time point t is detected, and first, the video frame image I at the current time point is detectedtInputting into a Mobilenet network for feature extraction, whereinHIAnd WIExtracting candidate characteristic graphs respectively corresponding to the height and the width of the video frameRepresents a real number, C1,H1And W1The number of feature channels, the height and the width of the candidate feature map are respectively.
Further, in step S2, a feature fusion window with a width w of S is set in the past time period of the current time point t, and the video frame images to be fused in the feature fusion window are: { It-i}i∈[1,s]And the candidate feature map corresponding to the video frame to be fused in the feature fusion window is as follows: { Ft-i}i∈[1,s]. Each video frame image I to be fused is processedt-iConversion into a grey-scale map Gt-iAnd calculating the Laplace variance of the image on the basis of the gray level map, wherein the Laplace operator at the G coordinate (x, y) of the gray level map isThe Laplace operator of the image captures an area with a rapidly changed pixel value in the image by calculating a second derivative of each pixel point of the image in each direction, and can be used for detecting corners in the image, the Laplace variance of the image reflects the pixel value change condition of the whole image, if the Laplace variance is large, the image is clear, and otherwise, the image is fuzzy.
First, each gray scale map G is calculatedt-iIs a laplace mean ofHIAnd WIHeight and width of the grey scale map respectively:
If a video frame is sharp, its candidate features contribute to the detection of objects, whereas some frames cause image blurring due to moving objects. The candidate characteristics of the frames are not beneficial to detecting the target, different fusion weights should be allocated to the video frames with different definition degrees, so that the detection model focuses more on the clear characteristics rather than the fuzzy characteristics, and the fusion weight alpha of all the video frames to be fused is calculated firstlyt-i:
Fusing the frame candidate features in the feature fusion window in a weighted summation mode to obtain the time sequence features of the current time pointAnd connecting the time sequence characteristics with the candidate characteristics of the current frame in a channel dimension to complete the fusion of time sequence information and obtain a first characteristic diagram to be detected for detection.
Further, in step S3, a feature map to be detected is obtained at the current time point, in which the time series feature is fusedThen, in order to obtain more scales of characteristic diagrams to be detected, a 3 x 3 convolutional layer and a 2 x 2 pooling layer are utilized to perform further characteristic extraction on the characteristic diagrams to be detected and reduce the size of the characteristic diagrams to be detected, so that local information in the characteristic diagrams to be detected with large size is rich, the characteristic diagrams to be detected with small size are suitable for predicting small-size targets, the characteristic diagrams to be detected with small size contain stronger global semantic information and are suitable for detecting targets with large size, and e-1 times of characteristic extraction are performed to finally obtain e characteristic diagrams to be detectedCharacteristic diagram to be detected:
further, in step S4, a multi-scale feature map to be detected is obtained through additional feature extraction, anchor frames with prior positions are set on the feature maps to be detected of different scales, and the offset of the target boundary frame relative to the anchor frame and the type of the target are respectively performed on the feature maps to be detected by using two 3 × 3 convolutional layers by using the channel dimension. Let the number of classes be d (including background), for each feature map to be detectedObtaining a classification prediction result after prediction of a 3 multiplied by 3 convolution type prediction layer and a 3 multiplied by 3 convolution bounding box prediction layerAnd bounding box prediction results
Drawings
FIG. 1 is a schematic of the present invention.
Detailed Description
The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic views, and merely illustrate the basic structure of the present invention in a schematic manner, and thus show only the constitution related to the present invention.
Example 1
As shown in fig. 1, the present example provides a video object detection method based on attention mechanism, comprising the following steps
Step S1, inputting the video frame image of the current time point into a Mobilenet network to extract candidate characteristic graphs;
step S2, setting a time sequence feature fusion window in the past time period adjacent to the current time point, respectively calculating the Laplacian variance of the images of the video frames to be fused in the feature fusion window, normalizing the Laplacian variance to serve as the fusion weight of each frame to be fused, carrying out weighted summation on the candidate feature graphs of all the frames to be fused according to the weight to obtain the time sequence feature required by the current frame, and connecting the candidate feature of the video frame at the current time point with the time sequence feature in a channel dimension to obtain the feature graph to be detected fused with time sequence information;
step S3, extracting the feature diagram to be detected with extra scales on the feature diagram to be detected by utilizing the convolution feature extraction layer and the maximum pooling layer;
and step S4, on the feature maps to be detected with different scales, predicting the object class and the boundary frame coordinates on the current frame by using the convolutional layer.
In step S1, detecting the current time point t video frame first, the current time point t video frame image I is detectedtInputting the Mobilene to perform feature extraction, whereinHIAnd WIRespectively the height and width of the frame image, and extracting candidate feature mapsWherein C is1,H1,W1The number of channels, height and width of the candidate feature map are respectively.
In step S2, a feature fusion window with width w of S is set in the past time slot of the current time point t, and the length of the past time slot is q, the setting rule of the feature fusion window width is as follows, that is, if the length of the past time slot is greater than S, the fusion window width is set to S, and if the length of the past time step is less than S and there is not enough features, the fusion window width is set to the length of the past time step.
And making the video frame image to be fused in the feature fusion window as: { It-i}i∈[1,s]Video frames to be fused within a feature fusion windowThe corresponding candidate feature maps are: { Ft-i}i∈[1,s]. Each video frame image I to be fused is processedt-iConversion into a grey-scale map Gt-iAnd calculating the Laplace variance of the image on the basis of the gray level image, wherein the Laplace operator at the G coordinate (x, y) of the gray level image is as follows:
where G (x, y) represents the pixel value of the grayscale map G at the coordinates (x, y). The Laplace operator of the image captures an area with a rapidly changed pixel value in the image by calculating a second derivative of each pixel point of the image in each direction, and can be used for detecting corners in the image, the Laplace variance of the image reflects the pixel value change condition of the whole image, if the Laplace variance is large, the image is clear, and otherwise, the image is fuzzy.
First, each gray scale map G is calculatedt-iIs a laplace mean ofHIAnd WIRespectively the height and width of the grey scale map.
If a video frame is sharp, its candidate features contribute to the detection of objects, whereas some frames cause image blurring due to moving objects. The candidate characteristics of the frames are not beneficial to detecting the target, and the video with different definition degrees is obtainedThe frames should be assigned different fusion weights, the clearer frame feature weight is larger, so that the detection model focuses more on the clear features rather than the fuzzy features, and the fusion weights alpha of all the video frames to be fused are calculated firstlyt-i:
Fusing the frame candidate features in the feature fusion window in a weighted summation mode to obtain the time sequence features of the current time point
Connecting the time sequence characteristics with the candidate characteristics of the current frame in the channel dimension to complete the fusion of the time sequence information and obtain the first characteristic diagram to be detected for detection
In step S3, a feature map to be detected is obtained in which the time sequence features are fused at the current time pointThen, in order to obtain a characteristic diagram to be detected with more scales, the feature diagram to be detected is further extracted by utilizing the convolution layer and the pooling layer, and the size of the characteristic diagram to be detected is reduced, so that the local information in the characteristic diagram to be detected with large size is rich, the characteristic diagram to be detected with small size is suitable for predicting small-size targets, the characteristic diagram to be detected with small size contains stronger global semantic information, and is suitable for detecting targets with large size through e-1 times of feature extraction,finally obtaining e characteristic graphs to be detected:
in the step S4, a multi-scale feature map to be detected is obtained through additional feature extraction, anchor frames with prior positions are set on the feature maps to be detected of different scales, and the offset of the target boundary frame relative to the anchor frame and the type of the target are respectively performed on the feature maps to be detected by using two convolutional layers by using the channel dimensions. Let the number of classes be d (including background), for each feature map to be detected Wherein C isFi,HFi,WFiThe number of channels, height and width of the feature map are respectively, and the number of anchor frames of each pixel position is niObtaining classification prediction results after prediction of convolution type prediction layer and convolution boundary frame prediction layerAnd bounding box prediction results
Claims (4)
1. A video target detection method based on an attention mechanism is characterized by comprising the following steps:
step S1, inputting the video frame image of the current time point into a Mobilene to extract a candidate feature map;
step S2, setting a time sequence feature fusion window in the past time period adjacent to the current time point, respectively calculating the Laplacian variance of the images of the video frames to be fused in the feature fusion window, normalizing the Laplacian variance to serve as the fusion weight of each frame to be fused, carrying out weighted summation on the candidate feature graphs of all the frames to be fused according to the weight to obtain the time sequence feature required by the current frame, and connecting the candidate feature of the video frame at the current time point with the time sequence feature in a channel dimension to obtain the feature graph to be detected fused with time sequence information;
step S3, extracting the feature diagram to be detected with extra scales on the feature diagram to be detected by utilizing the convolution feature extraction layer and the maximum pooling layer;
and step S4, on the feature maps to be detected with different scales, predicting the object class and the boundary frame coordinates on the current frame by using the convolutional layer.
2. The attention mechanism-based video object detection method of claim 1,
in step S1, the current time point t video frame is detected by first detecting the current time point t video frame image ItInputting the feature into a Mobilenet network to extract features to obtain a candidate feature map Ft(ii) a WhereinHIAnd WIExtracting candidate characteristic graphs respectively corresponding to the height and the width of the video frame Represents a real number, C1,H1And W1The number of feature channels, the height and the width of the candidate feature map are respectively.
3. The attention mechanism-based video object detection method of claim 2,
in step S2, a feature fusion window with a width w of S is set in the past time period of the current time point t, and the video frame images to be fused in the feature fusion window are: { It-i}i∈[1,s]And the candidate feature map corresponding to the video frame to be fused in the feature fusion window is as follows: { Ft-i}i∈[1,s](ii) a Each video frame image I to be fused is processedt-iConversion into a grey-scale map Gt-i;
Calculate each gray map Gt-iVariance of laplacianCalculating fusion weight alpha of all video frames to be fused by normalizing Laplace variancet-i(ii) a Fusing the frame candidate features in the feature fusion window in a weighted summation mode to obtain the time sequence features of the current time pointConnecting the time sequence characteristics with the candidate characteristics of the current frame in the channel dimension to complete the fusion of the time sequence information and obtain the first characteristic diagram to be detected for detection
4. The attention mechanism-based video object detection method of claim 3,
in step S3, a feature map to be detected is obtained in which the time sequence features are fused at the current time pointAnd then, performing further feature extraction on the feature diagram to be detected by utilizing a 3 × 3 convolutional layer and a 2 × 2 pooling layer, simultaneously reducing the size of the feature diagram to be detected, and finally obtaining e feature diagrams to be detected through e-1 feature extraction:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910499786.9A CN110287826B (en) | 2019-06-11 | 2019-06-11 | Video target detection method based on attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910499786.9A CN110287826B (en) | 2019-06-11 | 2019-06-11 | Video target detection method based on attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110287826A CN110287826A (en) | 2019-09-27 |
CN110287826B true CN110287826B (en) | 2021-09-17 |
Family
ID=68003699
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910499786.9A Active CN110287826B (en) | 2019-06-11 | 2019-06-11 | Video target detection method based on attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110287826B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674886B (en) * | 2019-10-08 | 2022-11-25 | 中兴飞流信息科技有限公司 | Video target detection method fusing multi-level features |
CN110751646A (en) * | 2019-10-28 | 2020-02-04 | 支付宝(杭州)信息技术有限公司 | Method and device for identifying damage by using multiple image frames in vehicle video |
CN111310609B (en) * | 2020-01-22 | 2023-04-07 | 西安电子科技大学 | Video target detection method based on time sequence information and local feature similarity |
CN114450720A (en) * | 2020-08-18 | 2022-05-06 | 深圳市大疆创新科技有限公司 | Target detection method and device and vehicle-mounted radar |
CN112016472B (en) * | 2020-08-31 | 2023-08-22 | 山东大学 | Driver attention area prediction method and system based on target dynamic information |
CN112434607B (en) * | 2020-11-24 | 2023-05-26 | 北京奇艺世纪科技有限公司 | Feature processing method, device, electronic equipment and computer readable storage medium |
CN112686913B (en) * | 2021-01-11 | 2022-06-10 | 天津大学 | Object boundary detection and object segmentation model based on boundary attention consistency |
CN112561001A (en) * | 2021-02-22 | 2021-03-26 | 南京智莲森信息技术有限公司 | Video target detection method based on space-time feature deformable convolution fusion |
CN113688801B (en) * | 2021-10-22 | 2022-02-15 | 南京智谱科技有限公司 | Chemical gas leakage detection method and system based on spectrum video |
CN114594770B (en) * | 2022-03-04 | 2024-04-26 | 深圳市千乘机器人有限公司 | Inspection method for inspection robot without stopping |
CN115131710B (en) * | 2022-07-05 | 2024-09-03 | 福州大学 | Real-time action detection method based on multiscale feature fusion attention |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102393958A (en) * | 2011-07-16 | 2012-03-28 | 西安电子科技大学 | Multi-focus image fusion method based on compressive sensing |
CN105913404A (en) * | 2016-07-01 | 2016-08-31 | 湖南源信光电科技有限公司 | Low-illumination imaging method based on frame accumulation |
CN107481238A (en) * | 2017-09-20 | 2017-12-15 | 众安信息技术服务有限公司 | Image quality measure method and device |
CN108921803A (en) * | 2018-06-29 | 2018-11-30 | 华中科技大学 | A kind of defogging method based on millimeter wave and visual image fusion |
CN109104568A (en) * | 2018-07-24 | 2018-12-28 | 苏州佳世达光电有限公司 | The intelligent cleaning driving method and drive system of monitoring camera |
CN109684912A (en) * | 2018-11-09 | 2019-04-26 | 中国科学院计算技术研究所 | A kind of video presentation method and system based on information loss function |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103152513B (en) * | 2011-12-06 | 2016-05-25 | 瑞昱半导体股份有限公司 | Image processing method and relevant image processing apparatus |
CN103702032B (en) * | 2013-12-31 | 2017-04-12 | 华为技术有限公司 | Image processing method, device and terminal equipment |
US10395118B2 (en) * | 2015-10-29 | 2019-08-27 | Baidu Usa Llc | Systems and methods for video paragraph captioning using hierarchical recurrent neural networks |
US10169656B2 (en) * | 2016-08-29 | 2019-01-01 | Nec Corporation | Video system using dual stage attention based recurrent neural network for future event prediction |
CN109829398B (en) * | 2019-01-16 | 2020-03-31 | 北京航空航天大学 | Target detection method in video based on three-dimensional convolution network |
-
2019
- 2019-06-11 CN CN201910499786.9A patent/CN110287826B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102393958A (en) * | 2011-07-16 | 2012-03-28 | 西安电子科技大学 | Multi-focus image fusion method based on compressive sensing |
CN105913404A (en) * | 2016-07-01 | 2016-08-31 | 湖南源信光电科技有限公司 | Low-illumination imaging method based on frame accumulation |
CN107481238A (en) * | 2017-09-20 | 2017-12-15 | 众安信息技术服务有限公司 | Image quality measure method and device |
CN108921803A (en) * | 2018-06-29 | 2018-11-30 | 华中科技大学 | A kind of defogging method based on millimeter wave and visual image fusion |
CN109104568A (en) * | 2018-07-24 | 2018-12-28 | 苏州佳世达光电有限公司 | The intelligent cleaning driving method and drive system of monitoring camera |
CN109684912A (en) * | 2018-11-09 | 2019-04-26 | 中国科学院计算技术研究所 | A kind of video presentation method and system based on information loss function |
Non-Patent Citations (2)
Title |
---|
Infrared dim target detection based on visual attention;Xin Wang;《Infrared Physics & Technology》;20121130;513-521 * |
基于提升小波变换的图像清晰度评价算法;王昕;《万方数据知识服务平台》;20100322;52-57 * |
Also Published As
Publication number | Publication date |
---|---|
CN110287826A (en) | 2019-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110287826B (en) | Video target detection method based on attention mechanism | |
CN108154118B (en) | A kind of target detection system and method based on adaptive combined filter and multistage detection | |
CN108830285B (en) | Target detection method for reinforcement learning based on fast-RCNN | |
Zhou et al. | Efficient road detection and tracking for unmanned aerial vehicle | |
US9042648B2 (en) | Salient object segmentation | |
CN110738673A (en) | Visual SLAM method based on example segmentation | |
CN111738055B (en) | Multi-category text detection system and bill form detection method based on same | |
CN113591968A (en) | Infrared weak and small target detection method based on asymmetric attention feature fusion | |
CN108564598B (en) | Improved online Boosting target tracking method | |
CN111723693A (en) | Crowd counting method based on small sample learning | |
CN110942471A (en) | Long-term target tracking method based on space-time constraint | |
CN112232371A (en) | American license plate recognition method based on YOLOv3 and text recognition | |
CN112883850A (en) | Multi-view aerospace remote sensing image matching method based on convolutional neural network | |
Lu et al. | Superthermal: Matching thermal as visible through thermal feature exploration | |
CN114299383A (en) | Remote sensing image target detection method based on integration of density map and attention mechanism | |
CN117949942B (en) | Target tracking method and system based on fusion of radar data and video data | |
CN115147418B (en) | Compression training method and device for defect detection model | |
CN111723660A (en) | Detection method for long ground target detection network | |
CN111833353B (en) | Hyperspectral target detection method based on image segmentation | |
CN109377511A (en) | Motion target tracking method based on sample combination and depth detection network | |
CN113496480A (en) | Method for detecting weld image defects | |
CN111414938B (en) | Target detection method for bubbles in plate heat exchanger | |
CN116342894A (en) | GIS infrared feature recognition system and method based on improved YOLOv5 | |
CN116645592A (en) | Crack detection method based on image processing and storage medium | |
CN116758340A (en) | Small target detection method based on super-resolution feature pyramid and attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |