CN109993095B - Frame level feature aggregation method for video target detection - Google Patents

Frame level feature aggregation method for video target detection Download PDF

Info

Publication number
CN109993095B
CN109993095B CN201910230227.8A CN201910230227A CN109993095B CN 109993095 B CN109993095 B CN 109993095B CN 201910230227 A CN201910230227 A CN 201910230227A CN 109993095 B CN109993095 B CN 109993095B
Authority
CN
China
Prior art keywords
frame
level
feature
weight
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910230227.8A
Other languages
Chinese (zh)
Other versions
CN109993095A (en
Inventor
张斌
柳波
郭军
刘晨
王嘉怡
李薇
张娅杰
王馨悦
刘文凤
陈文博
侯帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201910230227.8A priority Critical patent/CN109993095B/en
Publication of CN109993095A publication Critical patent/CN109993095A/en
Application granted granted Critical
Publication of CN109993095B publication Critical patent/CN109993095B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The invention provides a frame level feature aggregation method for video target detection, and relates to the technical field of computer vision. The invention provides a frame level feature aggregation method for video target detection, which comprises the steps of firstly extracting deep features from a single-frame image through a feature network; then extracting an optical flow between frames by using an optical flow network FlowNet; aligning the frame level features of the adjacent frames to the current frame based on the optical flow to realize the feature propagation of the frame level; finally, calculating a scaled cosine similarity weight through a mapping network and a weight scaling network, and aggregating multi-frame features by using the scaled cosine similarity weight to generate aggregated features; the frame-level feature aggregation method for video target detection enables weight distribution to be more reasonable, and enables video detection under complex scenes such as motion blur, low pixels, lens zooming, shielding and the like to have a good detection effect and robustness by inputting aggregated features into a video target detection network.

Description

Frame level feature aggregation method for video target detection
Technical Field
The invention relates to the technical field of computer vision, in particular to a frame level feature aggregation method for video target detection.
Background
In recent years, video object detection is gradually developed with the rise of deep learning. Since target detection in video has more time-series context information and motion information than target detection in a single image, many studies have been made to improve the performance of video target detection by using such information. The video target detection method automatically analyzes and processes a video sequence acquired by a camera, so as to realize detection, classification, identification and tracking of moving targets in a monitoring scene. Most of the existing feature level video target detection methods use a frame-level feature aggregation method. The purpose of the frame-level feature aggregation is to improve the accuracy of video target detection by using time and motion information between video frames, and the features of the current frame obtain stronger characterization capability by propagating the frame-level features of adjacent frames to the current frame and performing weighted aggregation with the features of the current frame. No matter the MANet directly uses a method of averaging multi-frame features or an existing FGFA calculation method of taking cosine similarity between propagation features and current frame features as weights of feature aggregation, the distribution of the weights does not consider appearance quality distribution of video frames, and pixel-level weights with few parameters are lacked. Under complex scenes such as motion blur, low pixels, lens zooming, shielding and the like, the method cannot accurately detect the moving target, so that the false detection rate and the omission rate of the video target according to detection are high.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a frame-level feature aggregation method for video object detection, which is used for implementing frame-level feature aggregation on a video object.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a frame-level feature aggregation method for video object detection comprises the following steps:
step 1: extracting frame level features;
extracting deep features from the single-frame image by using ResNet-101 as a feature network of the whole detection frame;
given a current frame I of a video and two frames I-t and I + t adjacent to the current frame, wherein t is a frame interval, image data I of the three frames i 、I i-t 、I i+t Input feature network
Figure BDA0002006464410000011
The output frame level characteristics are:
Figure BDA0002006464410000012
Figure BDA0002006464410000013
Figure BDA0002006464410000014
wherein, f i 、f i-t And f i+t Respectively representing the frame level characteristics of a current frame i and two frames i-t and i + t adjacent to the current frame;
and 2, step: extracting an optical flow between frames by using a fully-convoluted optical flow network FlowNet;
input image I given a current frame I and an adjacent frame I-t i 、I i-t Then the output optical flow is:
Figure BDA0002006464410000021
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002006464410000022
representing optical flow network FlowNet, M i-t→i Representing optical flow between the ith frame and the ith-t frame;
input image I given a current frame I and an adjacent frame I + t i 、I i-t Then the output optical flow is:
Figure BDA0002006464410000023
wherein, M i+t→i Representing optical flow between the ith frame and the (i + t) th frame;
and 3, step 3: aligning the frame level features of the adjacent frames to the current frame based on the optical flow, namely realizing the propagation of the frame level features;
frame-level feature f given adjacent frames i-t i-t And the optical flow M between the current frame and the sum i-t→i Then the characteristics propagated from frame i-t to frame i are:
Figure BDA0002006464410000024
wherein f is i-t→i Representing the frame level features propagated from the i-th frame to the current frame i,
Figure BDA0002006464410000025
representing warp mapping function, i-t frame characteristic f i-t Mapping the value of the middle position p to the position p + deltap corresponding to the current frame i, wherein deltap represents the position offset; warp mapping is implemented by bilinear sampling, for feature f i-t Given channel c, the mapping characteristic for location p is:
Figure BDA0002006464410000026
wherein p = (p) x ,p y ) The coordinates of the position p are represented by,
Figure BDA0002006464410000027
q enumerate all spatial locations of the feature, G (·,) represents a bilinear interpolation kernel; due to the flow of light M i-t→i Is the shift of the two channels in the x and y directions, so G (·,) is two-dimensional, dividing it into two corresponding one-dimensional interpolation kernels, as shown in the following equation:
G(q,p+δp)=g(q x ,p x +δp x )·g(q y ,p y +δp y ) (8)
wherein q = (q) x ,q y ) Coordinates representing the location q, g (·,) is a bilinear interpolation function;
frame-level feature f for a given adjacent frame i + t i+t And the optical flow M between the sum and the current frame i+t→i Then the characteristics propagated from frame i + t to frame i are:
Figure BDA0002006464410000028
wherein f is i+t→i Represents the frame-level features propagated from the i + t frame to the current frame i;
and 4, step 4: aggregating the multi-frame features by using the scaled cosine similarity weight to generate aggregated features;
the frame level feature f of the current frame i i Propagation of the i-t frame to the i frame feature f i-t→i Propagation of i + t frame to i frame feature f i+t→i And polymerizing to obtain the polymerized frame level characteristics, wherein the formula is as follows:
Figure BDA0002006464410000029
wherein the content of the first and second substances,
Figure BDA00020064644100000210
for frame-level features after aggregation, w j→i Representing frame level aggregation weight, T is the maximum frame interval of aggregation;
the calculation method of the frame level aggregation weight comprises the following steps:
(1) Modeling the mass distribution of the optical flow using cosine similarity weights;
using a shallow mapping network
Figure BDA0002006464410000031
Mapping features to dimensions that are specialized in computing similarity, as shown in the following equation:
Figure BDA0002006464410000032
Figure BDA0002006464410000033
wherein the content of the first and second substances,
Figure BDA0002006464410000034
is characterized by i And f i-t→i The characteristics of the image after the mapping are carried out,
Figure BDA0002006464410000035
to map the network;
given current frame characteristics f i And the feature f propagated by adjacent frames i-t→i Then the cosine similarity between them at spatial position p is:
Figure BDA0002006464410000036
the weights output by the formula (13) are summed along the channel, so that the dimensionality of the output weights is changed into a two-dimensional matrix, the dimensionality is W multiplied by H, and W and H are respectively the width and the height of the features, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train;
(2) Directly extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames to obtain scaling cosine similarity weights at the frame level, and taking the scaling cosine similarity weights as the frame level aggregation weights in the step 4;
given current frame characteristics f i And propagation characteristics f of the i-t frame i-t→i Then the weight scales the network
Figure BDA0002006464410000037
The output weight scaling factor is:
Figure BDA0002006464410000038
due to lambda i-t Is a channel-level vector, and the cosine similarity weight w i-t→i A matrix that is a 2-dimensional plane, the two being combined by channel-level multiplication in order to obtain pixel-level weights; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:
Figure BDA0002006464410000039
wherein the content of the first and second substances,
Figure BDA00020064644100000310
multiplication at the channel level;
obtaining the cosine similarity weight after scaling through formulas (14), (15) and (16);
accordingly, the weight of the i + t frame propagation feature is:
Figure BDA00020064644100000311
normalizing the weight of position p along the multiframe such that
Figure BDA00020064644100000312
The normalization operation is completed through a SoftMax function;
the mapping network and the weight scaling network share the first two layers, two continuous convolution layers of 1 multiplied by 1 convolution and 3 multiplied by 3 convolution are used after 1024-dimensional vectors output by ResNet-101, and then two branch subnets are connected; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped features
Figure BDA00020064644100000313
The second branch is also 1 multiplied by 1 convolution, and then is connected with a global average pooling layer to be used as a weight scaling network to generate a 1024-dimensional feature vector corresponding to each channel of ResNet-101 output feature vectors for measuring the importance degree of features and controllingScaling the feature time aggregation weights.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the invention provides a frame-level feature aggregation method for video target detection, which is a scaled cosine similarity weight aggregation method, and models the video frame quality and the optical flow quality simultaneously, so that the weight distribution is more reasonable, the pixel-level weight is generated by combining a channel-level weight scaling factor and a 2-dimensional cosine similarity weight, and the pixel-level weight is learned on the premise of not increasing the parameter magnitude; the aggregated features are input into a video target detection network, so that video detection under complex scenes such as motion blur, low pixel, lens zooming, shielding and the like has a good detection effect and robustness.
Drawings
Fig. 1 is a flowchart of a frame-level feature aggregation method for video object detection according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a frame-level feature aggregation process according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a process of extracting a scaled cosine similarity weight according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a FlowNet optical flow network according to an embodiment of the present invention;
FIG. 5 is a comparison chart of cosine similarity weight distributions of different methods according to an embodiment of the present invention;
fig. 6 is a visualization diagram of a distribution of weight scaling factors according to an embodiment of the present invention;
FIG. 7 is a comparison of normalized FGFA weights and scaled cosine similarity weights provided by an embodiment of the present invention;
fig. 8 is a video example of the method of the present invention for improving the target detection accuracy of video compared to FGFA according to the embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention.
In this embodiment, taking a certain video data set as an example, the feature aggregation method for video object detection oriented frame level is adopted to aggregate the features of the video data frame level;
a frame-level feature aggregation method for video object detection, as shown in fig. 1 and fig. 2, includes the following steps:
step 1: extracting frame level features;
extracting deep features from the single-frame image by using ResNet-101 as a feature network of the whole detection frame;
given a current frame I and two frames I-t and I + t adjacent to the current frame, wherein t is a frame interval, image data I of the three frames i 、I i-t 、I i+t Input feature network
Figure BDA0002006464410000041
The output frame level characteristics are:
Figure BDA0002006464410000042
Figure BDA0002006464410000043
Figure BDA0002006464410000044
wherein f is i 、f i-t And f i+t Respectively representing the frame level characteristics of a current frame i and two frames i-t and i + t adjacent to the current frame;
step 2: extracting an optical flow between frames by using a fully-convoluted optical flow network FlowNet;
input image I given a current frame I and an adjacent frame I-t i 、I i-t Then the output optical flow is:
Figure BDA0002006464410000051
wherein the content of the first and second substances,
Figure BDA0002006464410000052
representing optical flow network FlowNet, M i-t→i Representing optical flow between the ith frame and the ith-t frame;
input image I given a current frame I and an adjacent frame I + t i 、I i-t Then the output optical flow is:
Figure BDA0002006464410000053
wherein M is i+t→i Representing optical flow between the ith frame and the (i + t) th frame;
and step 3: aligning the frame level features of the adjacent frames to the current frame based on the optical flow, namely realizing the propagation of the frame level features;
frame-level feature f given adjacent frames i-t i-t And the optical flow M between the current frame and the sum i-t→i The characteristics propagated from frame i-t to frame i are:
Figure BDA0002006464410000054
wherein f is i-t→i Representing the frame-level features propagated from the i-th to the current frame i,
Figure BDA0002006464410000055
representing warp mapping function, i-t frame characteristic f i-t The value of the middle position p is mapped to the position p + deltap corresponding to the current frame i, and deltap represents the position offset; warp mapping is implemented by bilinear sampling, for feature f i-t Given channel c, the mapping characteristic for location p is:
Figure BDA0002006464410000056
wherein p = (p) x ,p y ) The coordinates of the position p are represented by,
Figure BDA0002006464410000057
q enumerate all spatial locations of the feature, G (·,) represents a bilinear interpolation kernel; due to the flow of light M i-t→i Is the shift of the two channels in the x and y directions, so G (·,) is two-dimensional, dividing it into two corresponding one-dimensional interpolation kernels, as shown in the following equation:
G(q,p+δp)=g(q x ,p x +δp x )·g(q y ,p y +δp y ) (8)
wherein q = (q) x ,q y ) Coordinates representing the location q, g (·,) is a bilinear interpolation function;
frame-level feature f for a given adjacent frame i + t i+t And the optical flow M between the sum and the current frame i+t→i Then the characteristics propagated from frame i + t to frame i are:
Figure BDA0002006464410000058
wherein f is i+t→i Represents the frame-level features propagated from the i + t frame to the current frame i;
and 4, step 4: aggregating the multi-frame features by using the scaled cosine similarity weight to generate aggregated features;
the frame level feature f of the current frame i i Propagation of the i-t frame to the i frame feature f i-t→i Propagation of the i + t frame to the i frame feature f i+t→i And polymerizing to obtain the polymerized frame level characteristics, wherein the formula is as follows:
Figure BDA0002006464410000059
wherein the content of the first and second substances,
Figure BDA00020064644100000510
for frame-level features after aggregation, w j→i Representing frame-level aggregate weights, i.e.Scaling cosine similarity weight, wherein T is maximum frame interval of aggregation;
the process of extracting the scaled cosine similarity weights is shown in fig. 3, and specifically includes:
(1) Modeling the mass distribution of the optical flow using cosine similarity weights;
using a shallow mapping network
Figure BDA0002006464410000061
Mapping features to dimensions that are specialized in computing similarity, as shown in the following equation:
Figure BDA0002006464410000062
Figure BDA0002006464410000063
wherein the content of the first and second substances,
Figure BDA0002006464410000064
is characterized by i And f i-t→i The characteristics of the image after the mapping are carried out,
Figure BDA0002006464410000065
to map the network;
given current frame characteristics f i And the feature f propagated by adjacent frames i-t→i Then the cosine similarity between them at spatial position p is:
Figure BDA0002006464410000066
the weights output by the formula (14) are summed along the channel, so that the dimensionality of the output weights is changed into a two-dimensional matrix, the dimensionality is W multiplied by H, and W and H are respectively the width and the height of the features, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train;
(2) Directly extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames to obtain scaling cosine similarity weight at a frame level, and taking the scaling cosine similarity weight as aggregation weight at the frame level;
given current frame characteristics f i And propagation characteristics f of the i-t frame i-t→i Then the weight scales the network
Figure BDA0002006464410000067
The output weight scaling factor is:
Figure BDA0002006464410000068
due to lambda i-t Is a channel-level vector, and the cosine similarity weight w i-t→i A matrix that is a 2-dimensional plane, the two are combined by multiplication at the channel level in order to obtain weights at the pixel level; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:
Figure BDA0002006464410000069
wherein the content of the first and second substances,
Figure BDA00020064644100000610
multiplication at the channel level;
obtaining the cosine similarity weight after scaling through formulas (14), (15) and (16);
accordingly, the weight of the i + t frame propagation feature is:
Figure BDA00020064644100000611
normalizing the weight of position p along the multiframe such that
Figure BDA00020064644100000612
The normalization operation is completed through a SoftMax function;
the mapping network and weight scalingThe first two layers are shared by the network, two continuous convolution layers of 1 multiplied by 1 convolution and 3 multiplied by 3 convolution are used after 1024-dimensional vectors output by ResNet-101, and then two branch subnets are connected; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped features
Figure BDA00020064644100000613
The second branch is also 1 × 1 convolution, and then is connected with a global average pooling layer to serve as a weight scaling network to generate a 1024-dimensional feature vector corresponding to each channel of the ResNet-101 output feature vector, so as to measure the importance degree of the features and control the scaling of the feature time aggregation weight.
In this embodiment, the data set used is a large-scale video target detection standard data set ImageNet VID. The VID data set includes 3862 training set videos on which the model is trained and 555 verification set videos on which the performance of the model is evaluated. The training dataset is fully labeled, each video frame rate is 25fps or 30fps, the dataset contains 30 classes, and belongs to a subset of ImageNet DET dataset classes. The VID verification set is divided into three parts according to the movement speed of the object, and the specific division standard is that the VID verification set is divided into three parts, namely slow (IoU > 0.9), medium (0.7 ≤ IoU ≤ 0.9) and fast (IoU < 0.7) according to the size of the inter-frame real frame crossing ratio IoU, and the embodiment tests on the three subsets.
In this embodiment, the optical flow network adopts a popular FlowNet network (simple version), and the model is pre-trained on a Flying Chairs data set; the FlowNet network structure is shown in fig. 4, wherein posing represents the pooling layer, and parameters of the pooling layer are window size, step size and pooling type from left to right. conv represents a convolution layer, and parameters of the convolution layer are a padding size, a convolution kernel size, a step size and the number of convolution kernels from left to right in sequence. In order to match the optical flow size with the feature size, the present embodiment downsamples the optical flow output by FlowNet, and completes the average pooling layer with a window size of 2 and a step size of 2.
In this embodiment, the feature network uses the ResNet-101 version of the mainstream deep residual network, pre-trained on the ImageNet 2012 classification dataset; in this embodiment, a pretrained ResNet-101 model is migrated to a detection task, the last average pooling layer and the full link layer are deleted, the convolutional layer is retained, meanwhile, in order to maintain the resolution of the features, the stride parameter of the convolutional layer of the last module (res 5) is modified to be 1, the output step size of the last residual block is reduced from 32 to 16, finally, an expansion convolution (the size of a convolution kernel is 3, the number of the convolution kernels is 1024) is added behind the output layer of the network, the number of output feature channels is reduced to 1024, and the 1024-dimensional vector is divided into 2 512-dimensional vectors along the channel dimension.
In this embodiment, the candidate region extraction network RPN is used to add two branch convolution layers to the first 512 channels of the 1024 channel feature map output by the ResNet-101, and perform anchor frame classification and bounding box regression, respectively. This embodiment uses 9 anchor frames, corresponding to 3 sizes and 3 aspect ratios, respectively. Non-maximum suppression (NMS) with a IoU threshold of 0.7 is applied during training and testing to select 300 candidate regions for training and testing the R-FCN detector.
During the training phase, this embodiment trains using both ImageNet DET training set, where only data that coincides with class 30 of VIDs is taken, and ImageNet VID training set, where this embodiment only uses the VID validation set for testing. ResNet-101 is pre-trained on the ImageNet CLS dataset, flowNet is pre-trained on the Flying Chairs dataset, and then trimmed on the VID training set.
During training, due to the limitation of memory, the present embodiment only uses three frames for simultaneous training, the frame interval is within 10, the previous frame and the next frame are used as adjacent frames, the intermediate frame is the current frame, and for the DET data set, the current frame and the adjacent frames are the same picture. Only the loss of the current frame is calculated, the classification loss is multi-class cross entropy loss, and the regression loss is smooth L1 loss. In order to improve the convergence speed of the loss function, the embodiment uses an OHEM method to perform back propagation on only the first 128 RoIs with the largest loss when calculating the R-FCN detection loss.
During testing, the present embodiment uses the common aggregation of the pictures of the previous and next 9 frames of the current frame to enhance the features of the current frame. In order to increase the detection speed, 18 frames of features are buffered in the memory during each detection.
This example was trained on a 1080Ti GPU using a random gradient descent with momentum with a batch size of 1. A total of about 44 ten thousand iterations were performed, with an initial learning rate of 0.00025, and the learning rate decayed once at the 30 th ten thousand iteration, with a decay rate of 0.1. In both training and testing, the short edge resize of the input image is reset to 600 and the long edge reset to 1000.
The embodiment also provides a comparison between the method and the characteristic aggregation method at multiple frame levels. In order to ensure the rigor of the comparison of the experimental results, all the methods use ResNet-101 as a feature network, the training strategies are the same, all models after 2 epochs of training are taken for testing, the test results are shown in Table 1, and Table 1 shows the accuracy comparison of different frame-level feature aggregation methods on ImageNet VID verification sets.
TABLE 1 comparison of accuracy of different methods on ImageNet VID validation set
Figure BDA0002006464410000081
Method (a) in the table represents a single frame detection baseline, no feature aggregation, and only current frame feature detection, corresponding to the R-FCN method on still images. The detection speed of the single-frame detection baseline is high because the features of other frames do not need to be extracted, but the detection accuracy is poor due to unaggregated frame level features, and is only 68.5%;
the method (b) in the table shows that the characteristics are aggregated by using a mean value method, and the pixel level correction method corresponding to the MANet full-motion perception network detection method is simple, the optical flow quality and the video frame quality are not modeled, the error of the propagation characteristics of a fast-moving object is large, and the quality of the characteristics of a current frame is reduced by direct averaging, so that the target detection performance with high motion speed is reduced, but the characteristics of multiple frames are aggregated, and the detection accuracy is improved to 71.9%;
the method (c) in the table represents a method using cosine similarity as the weight of the aggregation feature, and corresponds to the FGFA, because the cosine similarity models the optical flow quality, for an object with a faster motion speed, the worse optical flow quality, the lower weight is allocated, so that the detection performance of the object with a fast motion speed is greatly improved, and meanwhile, the overall detection performance is improved to 72.1%, but because the cosine similarity weight extracted by a mapping network is increased, the running time is increased;
the method (d) in the table represents a frame-level feature aggregation method based on the scaled cosine similarity weight, which is proposed by the present invention, using the scaled cosine similarity as the weight. As can be seen from Table 1, the method of the present invention improves the detection performance of the objects with the moving speeds of medium and slow, and improves the overall performance to 72.9%, which indicates that the detection performance can be improved well when the optical flow quality is good. The weight scaling network of the present invention shares most of the computation and therefore the run-time increase is small.
In order to more intuitively show the effect that the scaling cosine similarity weight makes the weight distribution more reasonable, the weight distribution in multiple frames is visualized by the embodiment.
First, the present embodiment visualizes the cosine similarity weight of the FGFA with the cosine similarity calculated in the weighting method of the present invention, as shown in fig. 5. It can be observed from fig. 5 that the cosine similarity weight remains maximum throughout the current frame and decreases gradually forward and backward along the time dimension, illustrating that the optical flow propagates more errors for frames with longer intervals. However, the cosine similarity weight calculated by the weighting method of the present invention is generally higher than that of the FGFA method (basically all the cosine similarity weights are kept above 0.9), which indicates that the method of the present invention can improve the quality of the optical flow, so that the similarity between the propagation feature and the current frame feature is larger.
Secondly, in order to verify that the weight scaling factor proposed by the present invention indeed models the quality of the video frame, the present embodiment also visualizes the original video frame corresponding to the scaling factor, as shown in fig. 6. As can be seen from the observation of FIG. 6, the quality distribution of the video frame is very uneven, most frames in the 19-frame image have occlusion problems (see two upper right images in FIG. 5), and the occlusion problem of the 1 st frame is lighter, so the quality evaluation is higher, which illustrates that the method of the invention models the quality of the video frame to a certain extent.
Finally, this embodiment visualizes the normalized FGFA weights and the scaled cosine similarity weights as shown in fig. 7. As can be seen from fig. 7, the weight variance of the present invention is smaller, and the distribution is smoother, especially when the quality of the current frame is poor, the weight allocated to the adjacent frame is close to, even higher than, the current frame, so as to increase the detection accuracy of the current frame. From fig. 6, it can be known that the quality of the current frame is poor, but the weight assigned to the current frame by the FGFA is much larger than the weights of other frames, so that the final aggregation effect is not good, the method of the present invention increases the weights of some frames with good quality, and finally successfully detects the object to be detected, as shown in fig. 8.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (6)

1. A frame-level feature aggregation method for video object detection is characterized by comprising the following steps: the method comprises the following steps:
step 1: extracting frame level features; extracting deep features from the single-frame image by using ResNet-101 as a feature network of the whole detection frame;
and 2, step: extracting an optical flow between frames by using a fully-convoluted optical flow network FlowNet;
and step 3: aligning the frame level features of the adjacent frames to the current frame based on the optical flow, namely realizing the propagation of the frame level features;
and 4, step 4: aggregating the multi-frame features by using the scaled cosine similarity weight to generate aggregated features;
the extraction process of the scaled cosine similarity weight comprises the following steps:
(1) Modeling the mass distribution of the optical flow using cosine similarity weights;
(2) And extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames, obtaining scaling cosine similarity weight at a frame level, and taking the scaling cosine similarity weight as frame level aggregation weight.
2. The method of claim 1, wherein the frame-level feature aggregation method for video object detection comprises: the specific method for extracting deep features from the single-frame image in the step 1 comprises the following steps:
given a current frame I of a video and two frames I-t and I + t adjacent to the current frame, wherein t is a frame interval, image data I of the three frames i 、I i-t 、I i+t Input feature network
Figure FDA0002006464400000011
The output frame level characteristics are:
Figure FDA0002006464400000012
Figure FDA0002006464400000013
Figure FDA0002006464400000014
wherein f is i 、f i-t And f i+t Respectively the frame level characteristics of the current frame i and the two frames i-t and i + t adjacent to the current frame.
3. The method of claim 2, wherein the frame-level feature aggregation method for video object detection comprises: the specific method for extracting the interframe optical flow by using the fully-convoluted optical flow network FlowNet in the step 2 comprises the following steps:
given a current frame i andinput image I of adjacent frames I-t i 、I i-t Then the output optical flow is:
Figure FDA0002006464400000015
wherein the content of the first and second substances,
Figure FDA0002006464400000016
representing optical flow network FlowNet, M i-t→i Representing optical flow between the ith frame and the ith-t frame;
input image I given a current frame I and an adjacent frame I + t i 、I i-t Then the output optical flow is:
Figure FDA0002006464400000017
wherein M is i+t→i Representing the optical flow between the ith frame and the (i + t) th frame.
4. The method of claim 3, wherein the frame-level feature aggregation method for video object detection comprises: the specific method of the step 3 comprises the following steps:
frame level feature f given adjacent frames i-t i-t And the optical flow M between the sum and the current frame i-t→i The characteristics propagated from frame i-t to frame i are:
Figure FDA0002006464400000018
wherein f is i-t→i Representing the frame-level features propagated from the i-th to the current frame i,
Figure FDA0002006464400000021
representing warp mapping function, i-t frame characteristic f i-t The value of the middle position p is mapped to the position p + deltap corresponding to the current frame i, and deltap represents the position offset;warp mapping is implemented by bilinear sampling, for feature f i-t Given channel c, the mapping characteristic for location p is:
Figure FDA0002006464400000022
wherein p = (p) x ,p y ) The coordinates of the position p are represented by,
Figure FDA0002006464400000023
q enumerate all spatial locations of the feature, G (·,) represents a bilinear interpolation kernel; due to the flow of light M i-t→i Is the shift of the two channels in the x and y directions, so G (·,) is two-dimensional, dividing it into two corresponding one-dimensional interpolation kernels, as shown in the following equation:
G(q,p+δp)=g(q x ,p x +δp x )·g(q y ,p y +δp y ) (8)
wherein q = (q) x ,q y ) The coordinates representing the location q, g (·,) is a bilinear interpolation function;
frame-level feature f for a given adjacent frame i + t i+t And the optical flow M between the sum and the current frame i+t→i Then the characteristics propagated from the i + t frame to the i frame are:
Figure FDA0002006464400000024
wherein f is i+t→i Representing the frame level features propagated from the i + t th frame to the current frame i.
5. The method of claim 4, wherein the frame-level feature aggregation is performed according to a video object detection algorithm, and comprises: the specific method of the step 4 comprises the following steps:
the frame level feature f of the current frame i i Propagation of i-t frame to i frame feature f i-t→i Propagation of i + t frame to i frame feature f i+t→i And polymerizing to obtain the polymerized frame level characteristics, wherein the formula is as follows:
Figure FDA0002006464400000025
wherein the content of the first and second substances,
Figure FDA0002006464400000026
for frame-level features after aggregation, w j→i Represents the frame-level aggregation weight, i.e., the scaled cosine similarity weight, and T is the maximum frame interval of the aggregation.
6. The method of claim 5, wherein the frame-level feature aggregation method for video object detection comprises: the specific method for modeling the quality distribution of the optical flow by using the cosine similarity weight in the step 4 comprises the following steps:
using a shallow mapping network
Figure FDA0002006464400000027
The features are mapped to dimensions that are specialized in computing similarity, as shown in the following equation:
Figure FDA0002006464400000028
Figure FDA0002006464400000029
wherein the content of the first and second substances,
Figure FDA00020064644000000210
is characterized by i And f i-t→i The characteristics of the image after the mapping are carried out,
Figure FDA00020064644000000211
to map the network;
the specific method for extracting the scaling factor from the appearance characteristics of the video frame and modeling the quality distribution of the video frame to obtain the scaling cosine similarity weight at the frame level comprises the following steps:
given current frame characteristics f i And features f propagated by adjacent frames i-t→i Then the cosine similarity between them at spatial position p is:
Figure FDA0002006464400000031
the weights output by the formula (13) are summed along the channel, so that the dimensionality of the output weights is changed into a two-dimensional matrix, the dimensionality is W multiplied by H, and W and H are respectively the width and the height of the features, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train;
given current frame characteristics f i And propagation characteristics f of the i-t frame i-t→i Then the weight scales the network
Figure FDA0002006464400000032
The output weight scaling factor is:
Figure FDA0002006464400000033
due to lambda i-t Is a channel-level vector, and the cosine similarity weight w i-t→i A matrix that is a 2-dimensional plane, the two are combined by multiplication at the channel level in order to obtain weights at the pixel level; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:
Figure FDA0002006464400000034
wherein the content of the first and second substances,
Figure FDA0002006464400000035
multiplication at the channel level;
obtaining the cosine similarity weight after scaling through formulas (14), (15) and (16);
accordingly, the weight of the i + t frame propagation feature is:
Figure FDA0002006464400000036
normalizing the weight of position p along the multiframe such that
Figure FDA0002006464400000037
The normalization operation is completed through a SoftMax function;
the mapping network and the weight scaling network share the first two layers, two continuous convolution layers of 1 multiplied by 1 convolution and 3 multiplied by 3 convolution are used after 1024-dimensional vectors output by ResNet-101, and then two branch subnets are connected; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped features
Figure FDA0002006464400000038
The second branch is also 1 × 1 convolution, and then is connected with a global average pooling layer to serve as a weight scaling network to generate a 1024-dimensional feature vector corresponding to each channel of the ResNet-101 output feature vector, so as to measure the importance degree of the features and control the scaling of the feature time aggregation weight.
CN201910230227.8A 2019-03-26 2019-03-26 Frame level feature aggregation method for video target detection Active CN109993095B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910230227.8A CN109993095B (en) 2019-03-26 2019-03-26 Frame level feature aggregation method for video target detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910230227.8A CN109993095B (en) 2019-03-26 2019-03-26 Frame level feature aggregation method for video target detection

Publications (2)

Publication Number Publication Date
CN109993095A CN109993095A (en) 2019-07-09
CN109993095B true CN109993095B (en) 2022-12-20

Family

ID=67131522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910230227.8A Active CN109993095B (en) 2019-03-26 2019-03-26 Frame level feature aggregation method for video target detection

Country Status (1)

Country Link
CN (1) CN109993095B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110533068B (en) * 2019-07-22 2020-07-17 杭州电子科技大学 Image object identification method based on classification convolutional neural network
CN110619655B (en) * 2019-08-23 2022-03-29 深圳大学 Target tracking method and device integrating optical flow information and Simese framework
CN110807437B (en) * 2019-11-08 2023-01-03 腾讯科技(深圳)有限公司 Video granularity characteristic determination method and device and computer-readable storage medium
CN111239757B (en) * 2020-03-12 2022-04-19 湖南大学 Automatic extraction method and system for road surface characteristic parameters
CN111583300B (en) * 2020-04-23 2023-04-25 天津大学 Target tracking method based on enrichment target morphological change update template
CN111476314B (en) * 2020-04-27 2023-03-07 中国科学院合肥物质科学研究院 Fuzzy video detection method integrating optical flow algorithm and deep learning
CN112307872A (en) * 2020-06-12 2021-02-02 北京京东尚科信息技术有限公司 Method and device for detecting target object
CN111814922B (en) * 2020-09-07 2020-12-25 成都索贝数码科技股份有限公司 Video clip content matching method based on deep learning
CN112966581B (en) * 2021-02-25 2022-05-27 厦门大学 Video target detection method based on internal and external semantic aggregation
CN113223044A (en) * 2021-04-21 2021-08-06 西北工业大学 Infrared video target detection method combining feature aggregation and attention mechanism
CN113435270A (en) * 2021-06-10 2021-09-24 上海商汤智能科技有限公司 Target detection method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354548A (en) * 2015-10-30 2016-02-24 武汉大学 Surveillance video pedestrian re-recognition method based on ImageNet retrieval
CN108242062A (en) * 2017-12-27 2018-07-03 北京纵目安驰智能科技有限公司 Method for tracking target, system, terminal and medium based on depth characteristic stream

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10152627B2 (en) * 2017-03-20 2018-12-11 Microsoft Technology Licensing, Llc Feature flow for video recognition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354548A (en) * 2015-10-30 2016-02-24 武汉大学 Surveillance video pedestrian re-recognition method based on ImageNet retrieval
CN108242062A (en) * 2017-12-27 2018-07-03 北京纵目安驰智能科技有限公司 Method for tracking target, system, terminal and medium based on depth characteristic stream

Also Published As

Publication number Publication date
CN109993095A (en) 2019-07-09

Similar Documents

Publication Publication Date Title
CN109993095B (en) Frame level feature aggregation method for video target detection
Zhu et al. Towards high performance video object detection
US11610082B2 (en) Method and apparatus for training neural network model used for image processing, and storage medium
CN108986050B (en) Image and video enhancement method based on multi-branch convolutional neural network
Uhrig et al. Sparsity invariant cnns
US20190311202A1 (en) Video object segmentation by reference-guided mask propagation
CN110163246A (en) The unsupervised depth estimation method of monocular light field image based on convolutional neural networks
CN111968123B (en) Semi-supervised video target segmentation method
CN110852267B (en) Crowd density estimation method and device based on optical flow fusion type deep neural network
CN104867111B (en) A kind of blind deblurring method of non-homogeneous video based on piecemeal fuzzy core collection
CN110120065B (en) Target tracking method and system based on hierarchical convolution characteristics and scale self-adaptive kernel correlation filtering
US20200349391A1 (en) Method for training image generation network, electronic device, and storage medium
Li et al. Data priming network for automatic check-out
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
TW202042176A (en) Method, device and electronic equipment for image generation network training and image processing
CN112819853B (en) Visual odometer method based on semantic priori
CN110610486A (en) Monocular image depth estimation method and device
CN104050685A (en) Moving target detection method based on particle filtering visual attention model
CN107194948B (en) Video significance detection method based on integrated prediction and time-space domain propagation
Zhang et al. Modeling long-and short-term temporal context for video object detection
Liu et al. ACDnet: An action detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation
Xiang et al. Deep optical flow supervised learning with prior assumptions
CN114170286A (en) Monocular depth estimation method based on unsupervised depth learning
Zhang et al. Multi-frame pyramid refinement network for video frame interpolation
CN114140469A (en) Depth hierarchical image semantic segmentation method based on multilayer attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant