CN109993095B

CN109993095B - Frame level feature aggregation method for video target detection

Info

Publication number: CN109993095B
Application number: CN201910230227.8A
Authority: CN
Inventors: 张斌; 柳波; 郭军; 刘晨; 王嘉怡; 李薇; 张娅杰; 王馨悦; 刘文凤; 陈文博; 侯帅
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2022-12-20
Anticipated expiration: 2039-03-26
Also published as: CN109993095A

Abstract

The invention provides a frame level feature aggregation method for video target detection, and relates to the technical field of computer vision. The invention provides a frame level feature aggregation method for video target detection, which comprises the steps of firstly extracting deep features from a single-frame image through a feature network; then extracting an optical flow between frames by using an optical flow network FlowNet; aligning the frame level features of the adjacent frames to the current frame based on the optical flow to realize the feature propagation of the frame level; finally, calculating a scaled cosine similarity weight through a mapping network and a weight scaling network, and aggregating multi-frame features by using the scaled cosine similarity weight to generate aggregated features; the frame-level feature aggregation method for video target detection enables weight distribution to be more reasonable, and enables video detection under complex scenes such as motion blur, low pixels, lens zooming, shielding and the like to have a good detection effect and robustness by inputting aggregated features into a video target detection network.

Description

Frame level feature aggregation method for video target detection

Technical Field

The invention relates to the technical field of computer vision, in particular to a frame level feature aggregation method for video target detection.

Background

In recent years, video object detection is gradually developed with the rise of deep learning. Since target detection in video has more time-series context information and motion information than target detection in a single image, many studies have been made to improve the performance of video target detection by using such information. The video target detection method automatically analyzes and processes a video sequence acquired by a camera, so as to realize detection, classification, identification and tracking of moving targets in a monitoring scene. Most of the existing feature level video target detection methods use a frame-level feature aggregation method. The purpose of the frame-level feature aggregation is to improve the accuracy of video target detection by using time and motion information between video frames, and the features of the current frame obtain stronger characterization capability by propagating the frame-level features of adjacent frames to the current frame and performing weighted aggregation with the features of the current frame. No matter the MANet directly uses a method of averaging multi-frame features or an existing FGFA calculation method of taking cosine similarity between propagation features and current frame features as weights of feature aggregation, the distribution of the weights does not consider appearance quality distribution of video frames, and pixel-level weights with few parameters are lacked. Under complex scenes such as motion blur, low pixels, lens zooming, shielding and the like, the method cannot accurately detect the moving target, so that the false detection rate and the omission rate of the video target according to detection are high.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a frame-level feature aggregation method for video object detection, which is used for implementing frame-level feature aggregation on a video object.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a frame-level feature aggregation method for video object detection comprises the following steps:

step 1: extracting frame level features;

extracting deep features from the single-frame image by using ResNet-101 as a feature network of the whole detection frame;

given a current frame I of a video and two frames I-t and I + t adjacent to the current frame, wherein t is a frame interval, image data I of the three frames _i 、I _i-t 、I _i+t Input feature network

The output frame level characteristics are:

wherein, f _i 、f _i-t And f _i+t Respectively representing the frame level characteristics of a current frame i and two frames i-t and i + t adjacent to the current frame;

and 2, step: extracting an optical flow between frames by using a fully-convoluted optical flow network FlowNet;

input image I given a current frame I and an adjacent frame I-t _i 、I _i-t Then the output optical flow is:

wherein, the first and the second end of the pipe are connected with each other,

representing optical flow network FlowNet, M _i-t→i Representing optical flow between the ith frame and the ith-t frame;

input image I given a current frame I and an adjacent frame I + t _i 、I _i-t Then the output optical flow is:

wherein, M _i+t→i Representing optical flow between the ith frame and the (i + t) th frame;

and 3, step 3: aligning the frame level features of the adjacent frames to the current frame based on the optical flow, namely realizing the propagation of the frame level features;

frame-level feature f given adjacent frames i-t _i-t And the optical flow M between the current frame and the sum _i-t→i Then the characteristics propagated from frame i-t to frame i are:

wherein f is _i-t→i Representing the frame level features propagated from the i-th frame to the current frame i,

representing warp mapping function, i-t frame characteristic f _i-t Mapping the value of the middle position p to the position p + deltap corresponding to the current frame i, wherein deltap represents the position offset; warp mapping is implemented by bilinear sampling, for feature f _i-t Given channel c, the mapping characteristic for location p is:

wherein p = (p) _x ，p _y ) The coordinates of the position p are represented by,

q enumerate all spatial locations of the feature, G (·,) represents a bilinear interpolation kernel; due to the flow of light M _i-t→i Is the shift of the two channels in the x and y directions, so G (·,) is two-dimensional, dividing it into two corresponding one-dimensional interpolation kernels, as shown in the following equation:

G(q，p+δp)＝g(q _x ，p _x +δp _x )·g(q _y ，p _y +δp _y ) (8)

wherein q = (q) _x ，q _y ) Coordinates representing the location q, g (·,) is a bilinear interpolation function;

frame-level feature f for a given adjacent frame i + t _i+t And the optical flow M between the sum and the current frame _i+t→i Then the characteristics propagated from frame i + t to frame i are:

wherein f is _i+t→i Represents the frame-level features propagated from the i + t frame to the current frame i;

and 4, step 4: aggregating the multi-frame features by using the scaled cosine similarity weight to generate aggregated features;

the frame level feature f of the current frame i _i Propagation of the i-t frame to the i frame feature f _i-t→i Propagation of i + t frame to i frame feature f _i+t→i And polymerizing to obtain the polymerized frame level characteristics, wherein the formula is as follows:

wherein the content of the first and second substances,

for frame-level features after aggregation, w _j→i Representing frame level aggregation weight, T is the maximum frame interval of aggregation;

the calculation method of the frame level aggregation weight comprises the following steps:

(1) Modeling the mass distribution of the optical flow using cosine similarity weights;

using a shallow mapping network

Mapping features to dimensions that are specialized in computing similarity, as shown in the following equation:

wherein the content of the first and second substances,

is characterized by _i And f _i-t→i The characteristics of the image after the mapping are carried out,

to map the network;

given current frame characteristics f _i And the feature f propagated by adjacent frames _i-t→i Then the cosine similarity between them at spatial position p is:

the weights output by the formula (13) are summed along the channel, so that the dimensionality of the output weights is changed into a two-dimensional matrix, the dimensionality is W multiplied by H, and W and H are respectively the width and the height of the features, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train;

(2) Directly extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames to obtain scaling cosine similarity weights at the frame level, and taking the scaling cosine similarity weights as the frame level aggregation weights in the step 4;

given current frame characteristics f _i And propagation characteristics f of the i-t frame _i-t→i Then the weight scales the network

The output weight scaling factor is:

due to lambda _i-t Is a channel-level vector, and the cosine similarity weight w _i-t→i A matrix that is a 2-dimensional plane, the two being combined by channel-level multiplication in order to obtain pixel-level weights; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:

wherein the content of the first and second substances,

multiplication at the channel level;

obtaining the cosine similarity weight after scaling through formulas (14), (15) and (16);

accordingly, the weight of the i + t frame propagation feature is:

normalizing the weight of position p along the multiframe such that

The normalization operation is completed through a SoftMax function;

the mapping network and the weight scaling network share the first two layers, two continuous convolution layers of 1 multiplied by 1 convolution and 3 multiplied by 3 convolution are used after 1024-dimensional vectors output by ResNet-101, and then two branch subnets are connected; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped features

The second branch is also 1 multiplied by 1 convolution, and then is connected with a global average pooling layer to be used as a weight scaling network to generate a 1024-dimensional feature vector corresponding to each channel of ResNet-101 output feature vectors for measuring the importance degree of features and controllingScaling the feature time aggregation weights.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the invention provides a frame-level feature aggregation method for video target detection, which is a scaled cosine similarity weight aggregation method, and models the video frame quality and the optical flow quality simultaneously, so that the weight distribution is more reasonable, the pixel-level weight is generated by combining a channel-level weight scaling factor and a 2-dimensional cosine similarity weight, and the pixel-level weight is learned on the premise of not increasing the parameter magnitude; the aggregated features are input into a video target detection network, so that video detection under complex scenes such as motion blur, low pixel, lens zooming, shielding and the like has a good detection effect and robustness.

Drawings

Fig. 1 is a flowchart of a frame-level feature aggregation method for video object detection according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a frame-level feature aggregation process according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a process of extracting a scaled cosine similarity weight according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a FlowNet optical flow network according to an embodiment of the present invention;

FIG. 5 is a comparison chart of cosine similarity weight distributions of different methods according to an embodiment of the present invention;

fig. 6 is a visualization diagram of a distribution of weight scaling factors according to an embodiment of the present invention;

FIG. 7 is a comparison of normalized FGFA weights and scaled cosine similarity weights provided by an embodiment of the present invention;

fig. 8 is a video example of the method of the present invention for improving the target detection accuracy of video compared to FGFA according to the embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention.

In this embodiment, taking a certain video data set as an example, the feature aggregation method for video object detection oriented frame level is adopted to aggregate the features of the video data frame level;

a frame-level feature aggregation method for video object detection, as shown in fig. 1 and fig. 2, includes the following steps:

step 1: extracting frame level features;

given a current frame I and two frames I-t and I + t adjacent to the current frame, wherein t is a frame interval, image data I of the three frames _i 、I _i-t 、I _i+t Input feature network

The output frame level characteristics are:

wherein f is _i 、f _i-t And f _i+t Respectively representing the frame level characteristics of a current frame i and two frames i-t and i + t adjacent to the current frame;

step 2: extracting an optical flow between frames by using a fully-convoluted optical flow network FlowNet;

wherein the content of the first and second substances,

wherein M is _i+t→i Representing optical flow between the ith frame and the (i + t) th frame;

and step 3: aligning the frame level features of the adjacent frames to the current frame based on the optical flow, namely realizing the propagation of the frame level features;

frame-level feature f given adjacent frames i-t _i-t And the optical flow M between the current frame and the sum _i-t→i The characteristics propagated from frame i-t to frame i are:

wherein f is _i-t→i Representing the frame-level features propagated from the i-th to the current frame i,

representing warp mapping function, i-t frame characteristic f _i-t The value of the middle position p is mapped to the position p + deltap corresponding to the current frame i, and deltap represents the position offset; warp mapping is implemented by bilinear sampling, for feature f _i-t Given channel c, the mapping characteristic for location p is:

G(q，p+δp)＝g(q _x ，p _x +δp _x )·g(q _y ，p _y +δp _y ) (8)

the frame level feature f of the current frame i _i Propagation of the i-t frame to the i frame feature f _i-t→i Propagation of the i + t frame to the i frame feature f _i+t→i And polymerizing to obtain the polymerized frame level characteristics, wherein the formula is as follows:

wherein the content of the first and second substances,

for frame-level features after aggregation, w _j→i Representing frame-level aggregate weights, i.e.Scaling cosine similarity weight, wherein T is maximum frame interval of aggregation;

the process of extracting the scaled cosine similarity weights is shown in fig. 3, and specifically includes:

using a shallow mapping network

wherein the content of the first and second substances,

to map the network;

the weights output by the formula (14) are summed along the channel, so that the dimensionality of the output weights is changed into a two-dimensional matrix, the dimensionality is W multiplied by H, and W and H are respectively the width and the height of the features, so that the number of weight parameters needing to be learned is reduced, and the network is easier to train;

(2) Directly extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames to obtain scaling cosine similarity weight at a frame level, and taking the scaling cosine similarity weight as aggregation weight at the frame level;

The output weight scaling factor is:

due to lambda _i-t Is a channel-level vector, and the cosine similarity weight w _i-t→i A matrix that is a 2-dimensional plane, the two are combined by multiplication at the channel level in order to obtain weights at the pixel level; for each channel c of scaled weights of the output, the pixel value at each spatial position p is calculated by:

wherein the content of the first and second substances,

multiplication at the channel level;

accordingly, the weight of the i + t frame propagation feature is:

normalizing the weight of position p along the multiframe such that

The normalization operation is completed through a SoftMax function;

the mapping network and weight scalingThe first two layers are shared by the network, two continuous convolution layers of 1 multiplied by 1 convolution and 3 multiplied by 3 convolution are used after 1024-dimensional vectors output by ResNet-101, and then two branch subnets are connected; the first branch is a 1 × 1 convolution as a mapping network for outputting the mapped features

The second branch is also 1 × 1 convolution, and then is connected with a global average pooling layer to serve as a weight scaling network to generate a 1024-dimensional feature vector corresponding to each channel of the ResNet-101 output feature vector, so as to measure the importance degree of the features and control the scaling of the feature time aggregation weight.

In this embodiment, the data set used is a large-scale video target detection standard data set ImageNet VID. The VID data set includes 3862 training set videos on which the model is trained and 555 verification set videos on which the performance of the model is evaluated. The training dataset is fully labeled, each video frame rate is 25fps or 30fps, the dataset contains 30 classes, and belongs to a subset of ImageNet DET dataset classes. The VID verification set is divided into three parts according to the movement speed of the object, and the specific division standard is that the VID verification set is divided into three parts, namely slow (IoU > 0.9), medium (0.7 ≤ IoU ≤ 0.9) and fast (IoU < 0.7) according to the size of the inter-frame real frame crossing ratio IoU, and the embodiment tests on the three subsets.

In this embodiment, the optical flow network adopts a popular FlowNet network (simple version), and the model is pre-trained on a Flying Chairs data set; the FlowNet network structure is shown in fig. 4, wherein posing represents the pooling layer, and parameters of the pooling layer are window size, step size and pooling type from left to right. conv represents a convolution layer, and parameters of the convolution layer are a padding size, a convolution kernel size, a step size and the number of convolution kernels from left to right in sequence. In order to match the optical flow size with the feature size, the present embodiment downsamples the optical flow output by FlowNet, and completes the average pooling layer with a window size of 2 and a step size of 2.

In this embodiment, the feature network uses the ResNet-101 version of the mainstream deep residual network, pre-trained on the ImageNet 2012 classification dataset; in this embodiment, a pretrained ResNet-101 model is migrated to a detection task, the last average pooling layer and the full link layer are deleted, the convolutional layer is retained, meanwhile, in order to maintain the resolution of the features, the stride parameter of the convolutional layer of the last module (res 5) is modified to be 1, the output step size of the last residual block is reduced from 32 to 16, finally, an expansion convolution (the size of a convolution kernel is 3, the number of the convolution kernels is 1024) is added behind the output layer of the network, the number of output feature channels is reduced to 1024, and the 1024-dimensional vector is divided into 2 512-dimensional vectors along the channel dimension.

In this embodiment, the candidate region extraction network RPN is used to add two branch convolution layers to the first 512 channels of the 1024 channel feature map output by the ResNet-101, and perform anchor frame classification and bounding box regression, respectively. This embodiment uses 9 anchor frames, corresponding to 3 sizes and 3 aspect ratios, respectively. Non-maximum suppression (NMS) with a IoU threshold of 0.7 is applied during training and testing to select 300 candidate regions for training and testing the R-FCN detector.

During the training phase, this embodiment trains using both ImageNet DET training set, where only data that coincides with class 30 of VIDs is taken, and ImageNet VID training set, where this embodiment only uses the VID validation set for testing. ResNet-101 is pre-trained on the ImageNet CLS dataset, flowNet is pre-trained on the Flying Chairs dataset, and then trimmed on the VID training set.

During training, due to the limitation of memory, the present embodiment only uses three frames for simultaneous training, the frame interval is within 10, the previous frame and the next frame are used as adjacent frames, the intermediate frame is the current frame, and for the DET data set, the current frame and the adjacent frames are the same picture. Only the loss of the current frame is calculated, the classification loss is multi-class cross entropy loss, and the regression loss is smooth L1 loss. In order to improve the convergence speed of the loss function, the embodiment uses an OHEM method to perform back propagation on only the first 128 RoIs with the largest loss when calculating the R-FCN detection loss.

During testing, the present embodiment uses the common aggregation of the pictures of the previous and next 9 frames of the current frame to enhance the features of the current frame. In order to increase the detection speed, 18 frames of features are buffered in the memory during each detection.

This example was trained on a 1080Ti GPU using a random gradient descent with momentum with a batch size of 1. A total of about 44 ten thousand iterations were performed, with an initial learning rate of 0.00025, and the learning rate decayed once at the 30 th ten thousand iteration, with a decay rate of 0.1. In both training and testing, the short edge resize of the input image is reset to 600 and the long edge reset to 1000.

The embodiment also provides a comparison between the method and the characteristic aggregation method at multiple frame levels. In order to ensure the rigor of the comparison of the experimental results, all the methods use ResNet-101 as a feature network, the training strategies are the same, all models after 2 epochs of training are taken for testing, the test results are shown in Table 1, and Table 1 shows the accuracy comparison of different frame-level feature aggregation methods on ImageNet VID verification sets.

TABLE 1 comparison of accuracy of different methods on ImageNet VID validation set

Method (a) in the table represents a single frame detection baseline, no feature aggregation, and only current frame feature detection, corresponding to the R-FCN method on still images. The detection speed of the single-frame detection baseline is high because the features of other frames do not need to be extracted, but the detection accuracy is poor due to unaggregated frame level features, and is only 68.5%;

the method (b) in the table shows that the characteristics are aggregated by using a mean value method, and the pixel level correction method corresponding to the MANet full-motion perception network detection method is simple, the optical flow quality and the video frame quality are not modeled, the error of the propagation characteristics of a fast-moving object is large, and the quality of the characteristics of a current frame is reduced by direct averaging, so that the target detection performance with high motion speed is reduced, but the characteristics of multiple frames are aggregated, and the detection accuracy is improved to 71.9%;

the method (c) in the table represents a method using cosine similarity as the weight of the aggregation feature, and corresponds to the FGFA, because the cosine similarity models the optical flow quality, for an object with a faster motion speed, the worse optical flow quality, the lower weight is allocated, so that the detection performance of the object with a fast motion speed is greatly improved, and meanwhile, the overall detection performance is improved to 72.1%, but because the cosine similarity weight extracted by a mapping network is increased, the running time is increased;

the method (d) in the table represents a frame-level feature aggregation method based on the scaled cosine similarity weight, which is proposed by the present invention, using the scaled cosine similarity as the weight. As can be seen from Table 1, the method of the present invention improves the detection performance of the objects with the moving speeds of medium and slow, and improves the overall performance to 72.9%, which indicates that the detection performance can be improved well when the optical flow quality is good. The weight scaling network of the present invention shares most of the computation and therefore the run-time increase is small.

In order to more intuitively show the effect that the scaling cosine similarity weight makes the weight distribution more reasonable, the weight distribution in multiple frames is visualized by the embodiment.

First, the present embodiment visualizes the cosine similarity weight of the FGFA with the cosine similarity calculated in the weighting method of the present invention, as shown in fig. 5. It can be observed from fig. 5 that the cosine similarity weight remains maximum throughout the current frame and decreases gradually forward and backward along the time dimension, illustrating that the optical flow propagates more errors for frames with longer intervals. However, the cosine similarity weight calculated by the weighting method of the present invention is generally higher than that of the FGFA method (basically all the cosine similarity weights are kept above 0.9), which indicates that the method of the present invention can improve the quality of the optical flow, so that the similarity between the propagation feature and the current frame feature is larger.

Secondly, in order to verify that the weight scaling factor proposed by the present invention indeed models the quality of the video frame, the present embodiment also visualizes the original video frame corresponding to the scaling factor, as shown in fig. 6. As can be seen from the observation of FIG. 6, the quality distribution of the video frame is very uneven, most frames in the 19-frame image have occlusion problems (see two upper right images in FIG. 5), and the occlusion problem of the 1 st frame is lighter, so the quality evaluation is higher, which illustrates that the method of the invention models the quality of the video frame to a certain extent.

Finally, this embodiment visualizes the normalized FGFA weights and the scaled cosine similarity weights as shown in fig. 7. As can be seen from fig. 7, the weight variance of the present invention is smaller, and the distribution is smoother, especially when the quality of the current frame is poor, the weight allocated to the adjacent frame is close to, even higher than, the current frame, so as to increase the detection accuracy of the current frame. From fig. 6, it can be known that the quality of the current frame is poor, but the weight assigned to the current frame by the FGFA is much larger than the weights of other frames, so that the final aggregation effect is not good, the method of the present invention increases the weights of some frames with good quality, and finally successfully detects the object to be detected, as shown in fig. 8.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A frame-level feature aggregation method for video object detection is characterized by comprising the following steps: the method comprises the following steps:

step 1: extracting frame level features; extracting deep features from the single-frame image by using ResNet-101 as a feature network of the whole detection frame;

the extraction process of the scaled cosine similarity weight comprises the following steps:

(2) And extracting scaling factors from the appearance characteristics of the video frames, modeling the quality distribution of the video frames, obtaining scaling cosine similarity weight at a frame level, and taking the scaling cosine similarity weight as frame level aggregation weight.

2. The method of claim 1, wherein the frame-level feature aggregation method for video object detection comprises: the specific method for extracting deep features from the single-frame image in the step 1 comprises the following steps:

The output frame level characteristics are:

wherein f is _i 、f _i-t And f _i+t Respectively the frame level characteristics of the current frame i and the two frames i-t and i + t adjacent to the current frame.

3. The method of claim 2, wherein the frame-level feature aggregation method for video object detection comprises: the specific method for extracting the interframe optical flow by using the fully-convoluted optical flow network FlowNet in the step 2 comprises the following steps:

given a current frame i andinput image I of adjacent frames I-t _i 、I _i-t Then the output optical flow is:

wherein the content of the first and second substances,

wherein M is _i+t→i Representing the optical flow between the ith frame and the (i + t) th frame.

4. The method of claim 3, wherein the frame-level feature aggregation method for video object detection comprises: the specific method of the step 3 comprises the following steps:

frame level feature f given adjacent frames i-t _i-t And the optical flow M between the sum and the current frame _i-t→i The characteristics propagated from frame i-t to frame i are:

representing warp mapping function, i-t frame characteristic f _i-t The value of the middle position p is mapped to the position p + deltap corresponding to the current frame i, and deltap represents the position offset;warp mapping is implemented by bilinear sampling, for feature f _i-t Given channel c, the mapping characteristic for location p is:

G(q，p+δp)＝g(q _x ，p _x +δp _x )·g(q _y ，p _y +δp _y ) (8)

wherein q = (q) _x ，q _y ) The coordinates representing the location q, g (·,) is a bilinear interpolation function;

frame-level feature f for a given adjacent frame i + t _i+t And the optical flow M between the sum and the current frame _i+t→i Then the characteristics propagated from the i + t frame to the i frame are:

wherein f is _i+t→i Representing the frame level features propagated from the i + t th frame to the current frame i.

5. The method of claim 4, wherein the frame-level feature aggregation is performed according to a video object detection algorithm, and comprises: the specific method of the step 4 comprises the following steps:

the frame level feature f of the current frame i _i Propagation of i-t frame to i frame feature f _i-t→i Propagation of i + t frame to i frame feature f _i+t→i And polymerizing to obtain the polymerized frame level characteristics, wherein the formula is as follows:

wherein the content of the first and second substances,

for frame-level features after aggregation, w _j→i Represents the frame-level aggregation weight, i.e., the scaled cosine similarity weight, and T is the maximum frame interval of the aggregation.

6. The method of claim 5, wherein the frame-level feature aggregation method for video object detection comprises: the specific method for modeling the quality distribution of the optical flow by using the cosine similarity weight in the step 4 comprises the following steps:

using a shallow mapping network

The features are mapped to dimensions that are specialized in computing similarity, as shown in the following equation:

wherein the content of the first and second substances,

to map the network;

the specific method for extracting the scaling factor from the appearance characteristics of the video frame and modeling the quality distribution of the video frame to obtain the scaling cosine similarity weight at the frame level comprises the following steps:

given current frame characteristics f _i And features f propagated by adjacent frames _i-t→i Then the cosine similarity between them at spatial position p is:

The output weight scaling factor is:

wherein the content of the first and second substances,

multiplication at the channel level;

accordingly, the weight of the i + t frame propagation feature is:

normalizing the weight of position p along the multiframe such that

The normalization operation is completed through a SoftMax function;