CN109993096A

CN109993096A - A kind of light stream multilayer frame feature propagation and polymerization towards video object detection

Info

Publication number: CN109993096A
Application number: CN201910230235.2A
Authority: CN
Inventors: 张斌; 柳波; 郭军; 刘晨; 张娅杰; 刘文凤; 王馨悦; 王嘉怡; 李薇; 陈文博; 侯帅
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2019-07-09
Anticipated expiration: 2039-03-26
Also published as: CN109993096B

Abstract

The present invention provides a kind of light stream multilayer frame feature propagation and polymerization towards video object detection, is related to technical field of computer vision.This method extracts the multilayer feature of consecutive frame by character network first, light stream network extracts light stream, then utilize light stream by the other feature propagation of the multilayer frame level of the former frame of present frame and a later frame of present frame to present frame, the different layer of step-length needs to do light stream up-sampling or down-sampling, obtains multilayer propagation characteristic；Then successively successively it polymerize every layer of propagation characteristic, the frame level characteristics for ultimately producing multilayer polymeric are detected for last video object.Light stream multilayer frame feature propagation and polymerization provided by the invention towards video object detection, so that the advantages of other aggregation features of frame level of output have taken into account shallow-layer network high resolution and deep layer network higher-dimension semantic feature, detection performance can be promoted, and the method for multilayer feature polymerization promotes the detection performance of Small object.

Description

A kind of light stream multilayer frame feature propagation and polymerization towards video object detection

Technical field

The present invention relates to technical field of computer vision more particularly to a kind of light stream multilayer frames towards video object detection Feature propagation and polymerization.

Background technique

Video object detection method can mainly be divided into two classes both at home and abroad at present, and one kind is the method for frame level, another Class is the method for the characteristic level based on light stream.In recent years, researcher focused on the high semantic feature of deep-neural-network extraction Level models the motion information between video frame by light stream, using the light stream of interframe by the feature propagation of consecutive frame to working as The advantages of feature of previous frame, prediction or enhancing present frame, this method is clear thinking, simple and effective, and can be end-to-end Training pattern.Although light stream can be used for the spatial alternation of feature hierarchy, the feature of interframe is propagated using Optic flow information There are errors, such as DFF and FGFA to have used the last one residual block of residual error network res5 to mention when propagating the feature between frame The feature taken, but since there are errors for light stream network, local feature is caused to be misaligned, cause two problems: first is that res5 The feature resolution of extraction is low, semantic hierarchies are high, and the semantic information that each pixel includes is very abundant, if in these presence It is directly detected on the propagation characteristic of error or is detected again after polymerizeing, and do not had to certain methods and correct these errors Pixel will have a direct impact on the performance of detection；Second is that residual block res5 extracts each pixel of feature on the original image Receptive field is larger, and some lesser targets in video are lower than 64 × 64 resolution ratio, in the corresponding characteristic value model of residual block res5 It encloses lower than 4 × 4, the influence that the error of single pixel point generates the detection of these Small objects, which is much larger than, is higher than 150 to biggish The big target detection of × 150 resolution ratio.In image object detection field, usually carried out simultaneously using the feature of character network multilayer Detection, to improve the detection accuracy of detection accuracy, especially Small object, referred to as feature pyramid, typical method such as SSD, FPN, the feature that above method demonstrates character network different levels have their own advantages, and joint multilayer detects together can effectively promote inspection Survey precision.

Summary of the invention

It is a kind of towards video object inspection the technical problem to be solved by the present invention is in view of the above shortcomings of the prior art, provide The light stream multilayer frame feature propagation and polymerization of survey, realize to the propagation of Optical-flow Feature with polymerize.

In order to solve the above technical problems, the technical solution used in the present invention is: a kind of light towards video object detection Multilayer frame feature propagation and polymerization are flowed, including the other feature extraction of multilayer frame level and communication process based on light stream and based on more The other characteristic aggregation process two parts of the frame level of Es-region propagations feature；

The other feature extraction of the multilayer frame level based on light stream and communication process, comprising the following steps:

Step S1: the multilayer feature of video consecutive frame is extracted；

Use residual error network ResNet-101 network as the character network for extracting frame level characteristics；The ResNet-101 Network has different step-lengths on different layers, and last three layers of output step-length of modification residual block res5 is 16, and in network Finally add an expansion convolutional layer, the Feature Dimension Reduction that residual block res5 is exported；

Step S2: the light stream of video is extracted using FlowNet light stream network, and light stream is post-processed, is directed to it The various sizes of feature of each layer of character network carries out size change over；

Step S2.1: the light stream of video is extracted using the Simple version of FlowNet network；Directly gone here and there from channel dimension Adjacent two frame for joining video image, 6 channel images after series connection are input in FlowNet network and extract light stream；

Step S2.2: for the size of matching characteristic, up-sampling and down-sampling are carried out to light stream；

Step S2.2.1: the current frame image I of given video_iWith its consecutive frame image I_i-t, then FlowNet network is defeated Shown in the following formula of light stream out:

Wherein,Indicate present frame I_iFrame I adjacent thereto_i-tLight stream, subscript 8 indicate step-length be 8,Indicate light stream Network FlowNet；

Step S2.2.2: up-sampling light stream, obtains the light stream that character pair step-length is 4, shown in following formula:

Wherein,Indicate present frame I_iFrame I adjacent thereto_i-tLight stream, subscript 4 indicate step-length be 4, upSample () indicates that arest neighbors up-samples function；

Step S2.2.3: carrying out down-sampling to light stream, obtains the light stream that character pair step-length is 16, shown in following formula:

Wherein,Indicate present frame I_iFrame I adjacent thereto_i-tLight stream, subscript 16 indicate step-length be 16, DownSample () indicates average pond down-sampling；

Step S2.2.4: ifIt is then corresponding Wherein C is port number, is defaulted as 2, H and W is respectively the height and width of light stream；It obtains being suitable for the light that multilayer feature is propagated Stream, shown in following formula:

Wherein, s indicates feature step-length；

Step S3: using light stream by the other feature propagation of the multilayer frame level of i-t frame and i+t frame to the i-th frame, multilayer propagation is obtained Feature

The given long light stream of multistepPropagation characteristic number of plies l and the i-th-t frame image I_i-t, then final propagation characteristic is logical Following formula is crossed to be calculated:

Wherein, l indicates the number of plies, and l ∈ (1, n), n are characterized total number of plies of network,Indicate l layers of character network Output；Warp mapping function is indicated, by the i-th-t frame feature f_i-tThe value of middle position p is mapped to the corresponding position of present frame i At p+ δ p, δ p indicates positional shift；

Then the multilayer propagation characteristic of the i-th+t frame is calculate by the following formula to obtain:

The other characteristic aggregation process of the frame level based on multilayer propagation characteristic, comprising the following steps:

Step C1: by the propagation characteristic of character network first layerPresent frame featureObtain character network Shown in the following formula of the aggregation features of first layer:

Wherein,The aggregation features of network first tier are characterized,It is similar for the scaling cosine of polymeric first layers feature Property weight；

Step C2: by the aggregation features of step C1It is input to the character network second layer as present frame feature, obtains spy SignThe propagation characteristic of the consecutive frame second layer is obtained simultaneouslyAggregation features again obtain character network second Shown in the following formula of aggregation features of layer:

Wherein,The aggregation features of the network second layer are characterized,It is similar for the scaling cosine of polymeric second layers feature Property weight；

Step C3: the above polymerization process is repeated, one by one every layer of aggregation features network of frame level characteristics, and defeated by upper one layer Aggregation features out are as next layer of present frame feature, and aggregation features until obtaining character network the last layer are following public Shown in formula:

Wherein,The aggregation features of network n-th layer are characterized,For the scaling cosine similarity for polymerizeing n-th layer feature Weight, n are characterized total number of plies of network；

The aggregation features of the character network n-th layerThe feature as detected eventually for video object,Both it polymerize The temporal information of multiframe, and it has polymerize the spatial information of character network multilayer；

The calculation method of the scaling cosine similarity weight of the polymerization n-th layer feature are as follows:

(1), using the Mass Distribution of cosine similarity weight modeling light stream；

Use the mapping network of a shallow-layerBy Feature Mapping to the dimension of dedicated calculation similitude, following formula institute Show:

Wherein,It is characterized f_iAnd f_i-t→iFeature after mapping,For mapping network；

Given present frame feature f_iThe feature f propagated with consecutive frame_i-t→i, then the cosine at the p of spatial position between them Similitude are as follows:

The weight of formula (14) output is summed along channel, and the weight dimension of output is made to become two-dimensional matrix, and dimension is W × H, The width and height that W and H are characterized respectively make network be easier to train to reduce the weight parameter quantity for needing to learn；

(2), scaling factor directly is extracted from the external appearance characteristic of video frame, the Mass Distribution of video frame is modeled, is obtained The other scaling cosine similarity weight of frame level, and as the other aggregate weight of the frame level of step 4；

Given present frame feature f_iWith the propagation characteristic f of the i-th-t frame_i-t→i, then weight scaling networkThe weight of output is put The contracting factor are as follows:

Due to λ_i-tFor the other vector of channel level, and cosine similarity weight w_i-t→iFor the matrix of 2 dimensional planes, in order to obtain Both the weight of pixel scale, combined by the multiplication of channel level；For each channel c of weight after the scaling of output, each Pixel value at the p of spatial position is calculate by the following formula to obtain:

Wherein,For the other multiplication of channel level；

Cosine similarity weight after obtaining scaling by formula (14), (15), (16)；

Correspondingly, the weight of the i-th+t frame propagation characteristic are as follows:

Along the weight of multiframe normalization position p, so thatNormalization operation passes through SoftMax Function is completed；

Two layers before the mapping network and weight scaling network share, make after 1024 dimensional vectors of ResNet-101 output With 1 × 1 convolution sum, 3 × 3 convolution, two continuous convolutional layers, Liang Ge branch subnet is then connected；First branches into volume 1 × 1 Product, as mapping network, for exporting the feature after mappingSecond branch is same Sample is 1 × 1 convolution, and then one overall situation of connection is averaged pond layer as weight scaling network and generates the feature of one 1024 dimension Vector, each channel of corresponding ResNet-101 output feature vector, for measuring the importance degree of feature, when controlling feature Between aggregate weight scaling scale.

The beneficial effects of adopting the technical scheme are that provided by the invention a kind of towards video object detection Light stream multilayer frame feature propagation and polymerization, character network shallow-layer output (res3 layers, res4 layers) on propagation characteristic, One side shallow-layer network high resolution is higher to the serious forgiveness of Small object when feature propagation；The propagation of another aspect shallow-layer network Error can be weakened by subsequent network, or even gradually be corrected.Then, in the shallow-layer of character network and deep layer while propagation characteristic And it polymerize deep layer and shallow-layer feature, the high semantic feature of deep layer network was not only utilized in this way, but also remain the high score of shallow-layer feature Resolution.So that the other aggregation features of frame level of output have taken into account the excellent of shallow-layer network high resolution and deep layer network higher-dimension semantic feature Point can promote detection performance, and the method for multilayer feature polymerization promotes the detection performance of Small object.

Detailed description of the invention

Fig. 1 is a kind of light stream multilayer frame feature propagation and polymerization towards video object detection provided in an embodiment of the present invention The flow chart of method；

Fig. 2 is the schematic diagram of the multilayer feature propagation and its polymerization process provided in an embodiment of the present invention based on light stream；

Fig. 3 is the schematic diagram of FlowNet network structure provided in an embodiment of the present invention (simple version)；

Fig. 4 is the comparison diagram of heterogeneous networks layer detection performance provided in an embodiment of the present invention；

Fig. 5 is the true frame area distributions histogram of ImageNet VID provided in an embodiment of the present invention verifying collection and its divides Group divides.

Specific embodiment

With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.

This implementation is by taking sets of video data ImageNet VID as an example, using of the invention a kind of towards video object detection Light stream multilayer frame feature propagation and its polymerization verify the video data；

A kind of light stream multilayer frame feature propagation and polymerization towards video object detection, as depicted in figs. 1 and 2, packet Include the other feature extraction of multilayer frame level and communication process and the other characteristic aggregation mistake of frame level based on multilayer propagation characteristic based on light stream Journey two parts；

Step S1: the multilayer feature of video consecutive frame is extracted；

Use residual error network ResNet-101 network as the character network for extracting frame level characteristics；The ResNet-101 Network has different step-lengths on different layers, with reference to R-FCN network, and modifies last three layers of output step-length of residual block res5 It is 16, and in one expansion convolutional layer of last addition of network, the Feature Dimension Reduction that res5 is exported；

The present embodiment, use the ResNet-101 network of modification as extract frame level characteristics character network, each layer Detailed step-length and space scale statistical data are shown in Table 1.ResNet-101 has different step-lengths on the different layer of network, and modification is most The output step-length of three layers of res5a_relu, res5a_relu, res5b_relu are 16 afterwards, and add a dilate=6, Expansion convolutional layer feat_conv_3 × 3_relu of kernel=3, pad=6, num_filters=1024.

Each layer step-length statistics of 1 ResNet-101 of table

Number	Each layer of ResNet-101	Step-length	Size
				1	res2a_relu	4	1/4
2	res2b_relu	4	1/4
				3	res2c_relu	4	1/4
4	res3a_relu	8	1/8
				5	res3b1_relu	8	1/8
6	res3b2_relu	8	1/8
				7	res3b3_relu	8	1/8
8	res4a_relu	16	1/16
				9	res4b1_relu	16	1/16
10	res4b2_relu	16	1/16
				…	…	…	…
30	res4b22_relu	16	1/16
				31	res5a_relu	16	1/16
32	res5b_relu	16	1/16
				33	feat_conv_3×3_relu	16	1/16

Due to the architectural characteristic of residual error network, the present embodiment has only counted the output layer of residual error module, and interior layer does not count, It will not be used for feature propagation, Number to indicate the number of corresponding network layer, and Layers lists ResNet-101 except preceding two All-network layer output outside layer, stride indicate the feature step-length of corresponding network layer output, and spatial_scale indicates to correspond to Scale/original image scale of layer output；In the present embodiment, res2b_relu layers, res3b3_relu layers, res4b22_ are used Relu layers and feat_conv_3 × 3_relu layers of progress multilayer feature propagation.

Step S2.1: the light stream of video is extracted using the Simple version of FlowNet network as shown in Figure 3；Directly from Adjacent two frame of series connection video image, 6 channel images after series connection are input in FlowNet network and extract light on channel dimension Stream；

The FlowNet network extracts the feature comprising higher-dimension semantic information of two field pictures by down-sampling CNN；

It a use of window size is first 2 × 2, the average pond layer that step-length is 2 will be originally inputted dimension of picture and halve, Then promote feature abstraction level by 9 continuous convolutional layers, while characteristic size becomes original 1/32；

The output characteristic pattern of down-sampling CNN has very high semanteme, but its resolution ratio is low, for original image, Characteristic pattern is during use, the detailed information being lost between many images, the light stream effect that such characterology comes out It is very poor, therefore FlowNet network introduces refining module after down-sampling CNN, improves feature resolution, learns high quality between image Light stream；

The refining module is based on FCN thought, operates using the deconvolution similar to FCN, the resolution ratio of lifting feature, In combination with the detailed information that the output feature supplement of front layer is lost, twin-channel light stream is finally exported；The refining module Network structure are as follows: the increasing of characteristic pattern size is twice by deconvolution first, then with corresponding convolution in down-sampling CNN Layer output characteristic pattern is connected in series together along channel dimension, and as next layer of input, subsequent process is substantially same, no With place in the light stream for learning a correspondingly-sized with a flow branching every time later, and by this light stream along channel dimension It is connected in series to output characteristic pattern, continues to input as next layer；

Step S2.2.4: ifIt is corresponding Wherein C is port number, is defaulted as 2, H and W is respectively the height and width of light stream；It obtains being suitable for the light that multilayer feature is propagated Stream, shown in following formula:

Wherein, s indicates feature step-length；

In the present embodiment, in order to propagate multilayer feature, identical light stream is used to each layer of same step-length；For example, will Res4a_relu layers to convolutional layer feat_conv_3 × 3_relu layers of expansion are all the light stream propagation characteristics for being 16 with step-length.

The given long light stream of multistepThe propagation characteristic number of plies 1 and the i-th-t frame image I_i-t, then final propagation characteristic is logical Following formula is crossed to be calculated:

Wherein, l indicates the number of plies, and l ∈ (1, n), n are characterized total number of plies of network, corresponding with first row Number in table 1,Indicate l layers of output of character network；Warp mapping function is indicated, by the i-th-t frame feature f_i-tThe value of middle position p It is mapped at the corresponding position p+ δ p of present frame i, 6p indicates positional shift；

Step C1: by the propagation characteristic of character network first layerPresent frame featureObtain feature net Shown in the following formula of the aggregation features of network first layer:

Step C2: by the aggregation features of step C 1It is input to the character network second layer as present frame feature, obtains spy SignThe propagation characteristic of the consecutive frame second layer is obtained simultaneouslyAggregation features again obtain the character network second layer The following formula of aggregation features shown in:

Step C 3: the above polymerization process is repeated, one by one every layer of aggregation features network of frame level characteristics, and by upper one layer The aggregation features of output are as next layer of present frame feature, and aggregation features until obtaining character network the last layer are as follows Shown in formula:

The aggregation features of the character network n-th layerThe feature as detected eventually for video object,Both it polymerize The temporal information of multiframe, and it has polymerize the spatial information of character network multilayer, significantly enhance the characterization of present frame feature Ability.

Wherein,For the other multiplication of channel level；

Cosine similarity weight after obtaining scaling by formula (14), (15), (16)；

The present embodiment selects the output test of three calibrated bolcks of ResNet-101, i.e., to the output res3c_ of res3 block Output conv_3 × 3_feat of the output res4b22_relu and res5 block of relu, res4 block is tested, and the present embodiment exists It is primary every 5 layers of sampling near res3c_relu, it is primary every 3 layers of sampling in res4 block, it finally samples out 9 layers and is surveyed Examination, corresponding number of plies number are (2,7,12,19,21,24,27,30,33), and the mean value mean accuracy of detection is compared such as Fig. 4 institute Show.From fig. 4, it can be seen that the accuracy rate of res4b22_relu is best, the performance of conv_3 × 3_feat is taken second place, res3c_relu Performance it is worst.And since the 17th layer, the layer performance decline of front is very fast, the gap contracting of subsequent layer mean value mean accuracy It is small, reach highest in the 30th layer of detection accuracy.It is more preferable to demonstrate shallow-layer network deeper network characterization propagation performance, but with Shoaling for the network number of plies, this performance can be saturated, or even due to the increase of resolution ratio, and light stream prediction difficulty is caused to increase, whole The decline of body detection performance.

The present embodiment is tested on ImageNet VID verifying collection.The feature propagation number of plies for adjusting FGFA, makes it As the baseline of each level, test result is as shown in table 2.

2 multilayer of table polymerize accuracy comparison with single layer propagation characteristic

Pass through the experimental result of table 2, it can be seen that the feature propagated using res4 the last layer (res4b22_relu) is poly- Conjunction is better than using res5 the last layer (FGFA), therefore more preferable using the performance of shallow-layer network deeper Internet communication feature. It is same from the results, it was seen that propagate the feature of res4 and res5 and polymerization, can further be promoted detection performance (72.1 → 73.6_↑1.5), demonstrate promotion of the multilayer feature polymerization to detection accuracy.

In order to further prove promotion of the method for multilayer feature polymerization to the detection performance of Small object, VID is verified Collection according to true frame area be divided into it is small, in, it is big three grouping, as shown in Figure 5.The criteria for classifying of target sizes be area between (0,64²) between be classified as it is small, between (64², 150²) between be classified as, be greater than 150²Be classified as it is big.The present embodiment The accounting distribution being respectively grouped that verifying is concentrated is counted, as shown in Figure 5.From figure 5 it can be seen that big target is concentrated in VID verifying Be in the great majority (60.0%), and seldom (13.5%), the present embodiment verifies this of collection in ImageNet VID respectively to Small object quantity Single deep layer (res5 the last layer) feature propagation, single shallow-layer (res4 the last layer) feature are tested in three groupings It propagates and the performance comparison of fusion multilayer (res4+res5 the last layer) propagation characteristic, test result is as shown in table 3.

3 distinct methods of table verify the detection accuracy in collection different size target in ImageNet VID

Method	Mean value mean accuracy (%) (small)	Mean value mean accuracy (%) (in)	Mean value is average smart (%) (big)
				FGFA(res5)	26.9	51.4	83.0
FGFA(res4)	29.5	50.8	84.1
				FGFA(res4+res5)	30.1	51.9	84.5

As shown in Table 3, shallow-layer characteristic aggregation to the detection performance of Small object be higher than further feature polymerization (26.9% → 29.5%_{↑ 2.6%}), illustrate for small target deteection, the error of shallow-layer feature propagation is influenced than the error that further feature is propagated It is smaller.It polymerize the feature of shallow-layer and deep layer simultaneously, all achieves best detection performance in all subdivisions of verifying collection, say Bright fusion deep layer, the feature of shallow-layer can more comprehensively promote detection performance, and demonstrate multilayer feature polymerization of the invention and calculate Method can merge the advantage of multilayer feature respectively well.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify to technical solution documented by previous embodiment, or some or all of the technical features are equal Replacement；And these are modified or replaceed, model defined by the claims in the present invention that it does not separate the essence of the corresponding technical solution It encloses.

Claims

1. a kind of light stream multilayer frame feature propagation and polymerization towards video object detection, it is characterised in that: including being based on The other feature extraction of the multilayer frame level of light stream and communication process and the other characteristic aggregation process of frame level two based on multilayer propagation characteristic Point；

Step S1: the multilayer feature of video consecutive frame is extracted；

Use residual error network ResNet-101 network as the character network for extracting frame level characteristics, ResNet-101 network is not The last three layers of output step-length for having different step-lengths on same layer, and modifying residual block res5 is 16, and finally adding in network Add an expansion convolutional layer, the Feature Dimension Reduction that residual block res5 is exported；

Step S2: being extracted the light stream of video using FlowNet light stream network, and post-processed to light stream, makes it for feature The various sizes of feature of each layer of network carries out size change over；

Step S2.1: the light stream of video is extracted using the Simple version of FlowNet network；It connects and regards directly from channel dimension Adjacent two frame of frequency image, 6 channel images after series connection are input in FlowNet network and extract light stream；

Step S2.2: for the size of matching characteristic, carrying out up-sampling and down-sampling to light stream, obtains being suitable for multilayer feature biography The light stream broadcast；

Step S3: the other feature propagation of the multilayer frame level of i-t frame and i+t frame is obtained into multilayer propagation characteristic to the i-th frame using light stream

Step C1: by the propagation characteristic of character network first layerPresent frame featureObtain character network first Shown in the following formula of aggregation features of layer:

Wherein,The aggregation features of network first tier are characterized,It is weighed for the scaling cosine similarity of polymeric first layers feature Weight；

Step C2: by the aggregation features of step C1It is input to the character network second layer as present frame feature, obtains feature The propagation characteristic of the consecutive frame second layer is obtained simultaneouslyAggregation features again obtain the poly- of the character network second layer It closes shown in the following formula of feature:

Wherein,The aggregation features of the network second layer are characterized,It is weighed for the scaling cosine similarity of polymeric second layers feature Weight；

Step C3: repeating the above polymerization process, one by one every layer of aggregation features network of frame level characteristics, and upper one layer is exported Aggregation features are as next layer of present frame feature, aggregation features until obtaining character network the last layer, following formula institute Show:

Wherein,The aggregation features of network n-th layer are characterized,For the scaling cosine similarity weight for polymerizeing n-th layer feature, n It is characterized total number of plies of network；

The aggregation features of the character network n-th layerThe feature as detected eventually for video object,Both it has polymerize multiframe Temporal information, and polymerize the spatial information of character network multilayer, significantly enhanced the characterization ability of present frame feature；

(2), scaling factor is extracted from the external appearance characteristic of video frame, the Mass Distribution of video frame is modeled, it is other to obtain frame level Scaling cosine similarity weight, and as the other aggregate weight of frame level.

2. a kind of light stream multilayer frame feature propagation and polymerization towards video object detection according to claim 1, It is characterized by: the step S2.2 method particularly includes:

Step S2.2.1: the current frame image I of given video_iWith its consecutive frame image I_i-t, then FlowNet network output light It flows shown in following formula:

Wherein,Indicate present frame I_iFrame I adjacent thereto_i-tLight stream, subscript 4 indicate step-length be 4, upSample () table Show that arest neighbors up-samples function；

Step S2.2.4: ifIt is then correspondingWherein C is port number, is defaulted as 2, H and W is respectively the height and width of light stream；It obtains being suitable for the light stream that multilayer feature is propagated, such as Shown in lower formula:

Wherein, s indicates feature step-length.

3. a kind of light stream multilayer frame feature propagation and polymerization towards video object detection according to claim 1, It is characterized by: the step S3 method particularly includes:

The given long light stream of multistepPropagation characteristic number of plies l and the i-th-t frame image I_i-t, then under final propagation characteristic passes through Formula is calculated:

Wherein, l indicates the number of plies, and l ∈ (1, n), n are characterized total number of plies of network,Indicate l layers of output of character network；Warp mapping function is indicated, by the i-th-t frame feature f_i-tThe value of middle position p is mapped to the corresponding position p+ δ p of present frame i Place, δ p indicate positional shift；

4. a kind of other characteristic aggregation method of frame level towards video object detection according to claim 1, it is characterised in that: The Mass Distribution of cosine similarity weight modeling light stream is used described in step C3 method particularly includes:

Use the mapping network of a shallow-layerBy Feature Mapping to the dimension of dedicated calculation similitude, shown in following formula:

Scaling factor is extracted in the external appearance characteristic from video frame, the Mass Distribution of video frame is modeled, it is other to obtain frame level Scaling cosine similarity weight method particularly includes:

Given present frame feature f_iThe feature f propagated with consecutive frame_i-t→i, then the cosine at the p of spatial position between them is similar Property are as follows:

The weight of formula (14) output is summed along channel, and the weight dimension of output is made to become two-dimensional matrix, and dimension is W × H, W and H The width and height being characterized respectively make network be easier to train to reduce the weight parameter quantity for needing to learn.

Given present frame feature f_iWith the propagation characteristic f of the i-th-t frame_i-t→i, then weight scaling networkThe weight scaling of output because Son are as follows:

Due to λ_i-tFor the other vector of channel level, and cosine similarity weight w_i-t→iFor the matrix of 2 dimensional planes, in order to obtain pixel Both the weight of rank, combined by the multiplication of channel level；For each the channel c, Mei Gekong of the weight after the scaling of output Between pixel value at the p of position, be calculate by the following formula to obtain:

Wherein,For the other multiplication of channel level；

By formula (14), (15, (16) obtain the cosine similarity weight after scaling；

Along the weight of multiframe normalization position p, so thatNormalization operation is complete by SoftMax function At；

Two layers before the mapping network and weight scaling network share, 1 is used after 1024 dimensional vectors of ResNet-101 output Then × 1 convolution sum 3 × 3 convolution, two continuous convolutional layers connect Liang Ge branch subnet；First branches into 1 × 1 convolution, As mapping network, for exporting the feature after mappingSecond branch is similarly Then 1 × 1 convolution connects an overall situation and be averaged pond layer, as weight scaling network, one 1024 feature tieed up of generation to Amount, each channel of corresponding ResNet-101 output feature vector, for measuring the importance degree of feature, controlling feature time The scaling scale of aggregate weight.