CN109993096A - A kind of light stream multilayer frame feature propagation and polymerization towards video object detection - Google Patents

A kind of light stream multilayer frame feature propagation and polymerization towards video object detection Download PDF

Info

Publication number
CN109993096A
CN109993096A CN201910230235.2A CN201910230235A CN109993096A CN 109993096 A CN109993096 A CN 109993096A CN 201910230235 A CN201910230235 A CN 201910230235A CN 109993096 A CN109993096 A CN 109993096A
Authority
CN
China
Prior art keywords
feature
frame
network
light stream
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910230235.2A
Other languages
Chinese (zh)
Other versions
CN109993096B (en
Inventor
张斌
柳波
郭军
刘晨
张娅杰
刘文凤
王馨悦
王嘉怡
李薇
陈文博
侯帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201910230235.2A priority Critical patent/CN109993096B/en
Publication of CN109993096A publication Critical patent/CN109993096A/en
Application granted granted Critical
Publication of CN109993096B publication Critical patent/CN109993096B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a kind of light stream multilayer frame feature propagation and polymerization towards video object detection, is related to technical field of computer vision.This method extracts the multilayer feature of consecutive frame by character network first, light stream network extracts light stream, then utilize light stream by the other feature propagation of the multilayer frame level of the former frame of present frame and a later frame of present frame to present frame, the different layer of step-length needs to do light stream up-sampling or down-sampling, obtains multilayer propagation characteristic;Then successively successively it polymerize every layer of propagation characteristic, the frame level characteristics for ultimately producing multilayer polymeric are detected for last video object.Light stream multilayer frame feature propagation and polymerization provided by the invention towards video object detection, so that the advantages of other aggregation features of frame level of output have taken into account shallow-layer network high resolution and deep layer network higher-dimension semantic feature, detection performance can be promoted, and the method for multilayer feature polymerization promotes the detection performance of Small object.

Description

A kind of light stream multilayer frame feature propagation and polymerization towards video object detection
Technical field
The present invention relates to technical field of computer vision more particularly to a kind of light stream multilayer frames towards video object detection Feature propagation and polymerization.
Background technique
Video object detection method can mainly be divided into two classes both at home and abroad at present, and one kind is the method for frame level, another Class is the method for the characteristic level based on light stream.In recent years, researcher focused on the high semantic feature of deep-neural-network extraction Level models the motion information between video frame by light stream, using the light stream of interframe by the feature propagation of consecutive frame to working as The advantages of feature of previous frame, prediction or enhancing present frame, this method is clear thinking, simple and effective, and can be end-to-end Training pattern.Although light stream can be used for the spatial alternation of feature hierarchy, the feature of interframe is propagated using Optic flow information There are errors, such as DFF and FGFA to have used the last one residual block of residual error network res5 to mention when propagating the feature between frame The feature taken, but since there are errors for light stream network, local feature is caused to be misaligned, cause two problems: first is that res5 The feature resolution of extraction is low, semantic hierarchies are high, and the semantic information that each pixel includes is very abundant, if in these presence It is directly detected on the propagation characteristic of error or is detected again after polymerizeing, and do not had to certain methods and correct these errors Pixel will have a direct impact on the performance of detection;Second is that residual block res5 extracts each pixel of feature on the original image Receptive field is larger, and some lesser targets in video are lower than 64 × 64 resolution ratio, in the corresponding characteristic value model of residual block res5 It encloses lower than 4 × 4, the influence that the error of single pixel point generates the detection of these Small objects, which is much larger than, is higher than 150 to biggish The big target detection of × 150 resolution ratio.In image object detection field, usually carried out simultaneously using the feature of character network multilayer Detection, to improve the detection accuracy of detection accuracy, especially Small object, referred to as feature pyramid, typical method such as SSD, FPN, the feature that above method demonstrates character network different levels have their own advantages, and joint multilayer detects together can effectively promote inspection Survey precision.
Summary of the invention
It is a kind of towards video object inspection the technical problem to be solved by the present invention is in view of the above shortcomings of the prior art, provide The light stream multilayer frame feature propagation and polymerization of survey, realize to the propagation of Optical-flow Feature with polymerize.
In order to solve the above technical problems, the technical solution used in the present invention is: a kind of light towards video object detection Multilayer frame feature propagation and polymerization are flowed, including the other feature extraction of multilayer frame level and communication process based on light stream and based on more The other characteristic aggregation process two parts of the frame level of Es-region propagations feature;
The other feature extraction of the multilayer frame level based on light stream and communication process, comprising the following steps:
Step S1: the multilayer feature of video consecutive frame is extracted;
Use residual error network ResNet-101 network as the character network for extracting frame level characteristics;The ResNet-101 Network has different step-lengths on different layers, and last three layers of output step-length of modification residual block res5 is 16, and in network Finally add an expansion convolutional layer, the Feature Dimension Reduction that residual block res5 is exported;
Step S2: the light stream of video is extracted using FlowNet light stream network, and light stream is post-processed, is directed to it The various sizes of feature of each layer of character network carries out size change over;
Step S2.1: the light stream of video is extracted using the Simple version of FlowNet network;Directly gone here and there from channel dimension Adjacent two frame for joining video image, 6 channel images after series connection are input in FlowNet network and extract light stream;
Step S2.2: for the size of matching characteristic, up-sampling and down-sampling are carried out to light stream;
Step S2.2.1: the current frame image I of given videoiWith its consecutive frame image Ii-t, then FlowNet network is defeated Shown in the following formula of light stream out:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 8 indicate step-length be 8,Indicate light stream Network FlowNet;
Step S2.2.2: up-sampling light stream, obtains the light stream that character pair step-length is 4, shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 4 indicate step-length be 4, upSample () indicates that arest neighbors up-samples function;
Step S2.2.3: carrying out down-sampling to light stream, obtains the light stream that character pair step-length is 16, shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 16 indicate step-length be 16, DownSample () indicates average pond down-sampling;
Step S2.2.4: ifIt is then corresponding Wherein C is port number, is defaulted as 2, H and W is respectively the height and width of light stream;It obtains being suitable for the light that multilayer feature is propagated Stream, shown in following formula:
Wherein, s indicates feature step-length;
Step S3: using light stream by the other feature propagation of the multilayer frame level of i-t frame and i+t frame to the i-th frame, multilayer propagation is obtained Feature
The given long light stream of multistepPropagation characteristic number of plies l and the i-th-t frame image Ii-t, then final propagation characteristic is logical Following formula is crossed to be calculated:
Wherein, l indicates the number of plies, and l ∈ (1, n), n are characterized total number of plies of network,Indicate l layers of character network Output;Warp mapping function is indicated, by the i-th-t frame feature fi-tThe value of middle position p is mapped to the corresponding position of present frame i At p+ δ p, δ p indicates positional shift;
Then the multilayer propagation characteristic of the i-th+t frame is calculate by the following formula to obtain:
The other characteristic aggregation process of the frame level based on multilayer propagation characteristic, comprising the following steps:
Step C1: by the propagation characteristic of character network first layerPresent frame featureObtain character network Shown in the following formula of the aggregation features of first layer:
Wherein,The aggregation features of network first tier are characterized,It is similar for the scaling cosine of polymeric first layers feature Property weight;
Step C2: by the aggregation features of step C1It is input to the character network second layer as present frame feature, obtains spy SignThe propagation characteristic of the consecutive frame second layer is obtained simultaneouslyAggregation features again obtain character network second Shown in the following formula of aggregation features of layer:
Wherein,The aggregation features of the network second layer are characterized,It is similar for the scaling cosine of polymeric second layers feature Property weight;
Step C3: the above polymerization process is repeated, one by one every layer of aggregation features network of frame level characteristics, and defeated by upper one layer Aggregation features out are as next layer of present frame feature, and aggregation features until obtaining character network the last layer are following public Shown in formula:
Wherein,The aggregation features of network n-th layer are characterized,For the scaling cosine similarity for polymerizeing n-th layer feature Weight, n are characterized total number of plies of network;
The aggregation features of the character network n-th layerThe feature as detected eventually for video object,Both it polymerize The temporal information of multiframe, and it has polymerize the spatial information of character network multilayer;
The calculation method of the scaling cosine similarity weight of the polymerization n-th layer feature are as follows:
(1), using the Mass Distribution of cosine similarity weight modeling light stream;
Use the mapping network of a shallow-layerBy Feature Mapping to the dimension of dedicated calculation similitude, following formula institute Show:
Wherein,It is characterized fiAnd fi-t→iFeature after mapping,For mapping network;
Given present frame feature fiThe feature f propagated with consecutive framei-t→i, then the cosine at the p of spatial position between them Similitude are as follows:
The weight of formula (14) output is summed along channel, and the weight dimension of output is made to become two-dimensional matrix, and dimension is W × H, The width and height that W and H are characterized respectively make network be easier to train to reduce the weight parameter quantity for needing to learn;
(2), scaling factor directly is extracted from the external appearance characteristic of video frame, the Mass Distribution of video frame is modeled, is obtained The other scaling cosine similarity weight of frame level, and as the other aggregate weight of the frame level of step 4;
Given present frame feature fiWith the propagation characteristic f of the i-th-t framei-t→i, then weight scaling networkThe weight of output is put The contracting factor are as follows:
Due to λi-tFor the other vector of channel level, and cosine similarity weight wi-t→iFor the matrix of 2 dimensional planes, in order to obtain Both the weight of pixel scale, combined by the multiplication of channel level;For each channel c of weight after the scaling of output, each Pixel value at the p of spatial position is calculate by the following formula to obtain:
Wherein,For the other multiplication of channel level;
Cosine similarity weight after obtaining scaling by formula (14), (15), (16);
Correspondingly, the weight of the i-th+t frame propagation characteristic are as follows:
Along the weight of multiframe normalization position p, so thatNormalization operation passes through SoftMax Function is completed;
Two layers before the mapping network and weight scaling network share, make after 1024 dimensional vectors of ResNet-101 output With 1 × 1 convolution sum, 3 × 3 convolution, two continuous convolutional layers, Liang Ge branch subnet is then connected;First branches into volume 1 × 1 Product, as mapping network, for exporting the feature after mappingSecond branch is same Sample is 1 × 1 convolution, and then one overall situation of connection is averaged pond layer as weight scaling network and generates the feature of one 1024 dimension Vector, each channel of corresponding ResNet-101 output feature vector, for measuring the importance degree of feature, when controlling feature Between aggregate weight scaling scale.
The beneficial effects of adopting the technical scheme are that provided by the invention a kind of towards video object detection Light stream multilayer frame feature propagation and polymerization, character network shallow-layer output (res3 layers, res4 layers) on propagation characteristic, One side shallow-layer network high resolution is higher to the serious forgiveness of Small object when feature propagation;The propagation of another aspect shallow-layer network Error can be weakened by subsequent network, or even gradually be corrected.Then, in the shallow-layer of character network and deep layer while propagation characteristic And it polymerize deep layer and shallow-layer feature, the high semantic feature of deep layer network was not only utilized in this way, but also remain the high score of shallow-layer feature Resolution.So that the other aggregation features of frame level of output have taken into account the excellent of shallow-layer network high resolution and deep layer network higher-dimension semantic feature Point can promote detection performance, and the method for multilayer feature polymerization promotes the detection performance of Small object.
Detailed description of the invention
Fig. 1 is a kind of light stream multilayer frame feature propagation and polymerization towards video object detection provided in an embodiment of the present invention The flow chart of method;
Fig. 2 is the schematic diagram of the multilayer feature propagation and its polymerization process provided in an embodiment of the present invention based on light stream;
Fig. 3 is the schematic diagram of FlowNet network structure provided in an embodiment of the present invention (simple version);
Fig. 4 is the comparison diagram of heterogeneous networks layer detection performance provided in an embodiment of the present invention;
Fig. 5 is the true frame area distributions histogram of ImageNet VID provided in an embodiment of the present invention verifying collection and its divides Group divides.
Specific embodiment
With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.
This implementation is by taking sets of video data ImageNet VID as an example, using of the invention a kind of towards video object detection Light stream multilayer frame feature propagation and its polymerization verify the video data;
A kind of light stream multilayer frame feature propagation and polymerization towards video object detection, as depicted in figs. 1 and 2, packet Include the other feature extraction of multilayer frame level and communication process and the other characteristic aggregation mistake of frame level based on multilayer propagation characteristic based on light stream Journey two parts;
The other feature extraction of the multilayer frame level based on light stream and communication process, comprising the following steps:
Step S1: the multilayer feature of video consecutive frame is extracted;
Use residual error network ResNet-101 network as the character network for extracting frame level characteristics;The ResNet-101 Network has different step-lengths on different layers, with reference to R-FCN network, and modifies last three layers of output step-length of residual block res5 It is 16, and in one expansion convolutional layer of last addition of network, the Feature Dimension Reduction that res5 is exported;
The present embodiment, use the ResNet-101 network of modification as extract frame level characteristics character network, each layer Detailed step-length and space scale statistical data are shown in Table 1.ResNet-101 has different step-lengths on the different layer of network, and modification is most The output step-length of three layers of res5a_relu, res5a_relu, res5b_relu are 16 afterwards, and add a dilate=6, Expansion convolutional layer feat_conv_3 × 3_relu of kernel=3, pad=6, num_filters=1024.
Each layer step-length statistics of 1 ResNet-101 of table
Number Each layer of ResNet-101 Step-length Size
1 res2a_relu 4 1/4
2 res2b_relu 4 1/4
3 res2c_relu 4 1/4
4 res3a_relu 8 1/8
5 res3b1_relu 8 1/8
6 res3b2_relu 8 1/8
7 res3b3_relu 8 1/8
8 res4a_relu 16 1/16
9 res4b1_relu 16 1/16
10 res4b2_relu 16 1/16
30 res4b22_relu 16 1/16
31 res5a_relu 16 1/16
32 res5b_relu 16 1/16
33 feat_conv_3×3_relu 16 1/16
Due to the architectural characteristic of residual error network, the present embodiment has only counted the output layer of residual error module, and interior layer does not count, It will not be used for feature propagation, Number to indicate the number of corresponding network layer, and Layers lists ResNet-101 except preceding two All-network layer output outside layer, stride indicate the feature step-length of corresponding network layer output, and spatial_scale indicates to correspond to Scale/original image scale of layer output;In the present embodiment, res2b_relu layers, res3b3_relu layers, res4b22_ are used Relu layers and feat_conv_3 × 3_relu layers of progress multilayer feature propagation.
Step S2: the light stream of video is extracted using FlowNet light stream network, and light stream is post-processed, is directed to it The various sizes of feature of each layer of character network carries out size change over;
Step S2.1: the light stream of video is extracted using the Simple version of FlowNet network as shown in Figure 3;Directly from Adjacent two frame of series connection video image, 6 channel images after series connection are input in FlowNet network and extract light on channel dimension Stream;
The FlowNet network extracts the feature comprising higher-dimension semantic information of two field pictures by down-sampling CNN;
It a use of window size is first 2 × 2, the average pond layer that step-length is 2 will be originally inputted dimension of picture and halve, Then promote feature abstraction level by 9 continuous convolutional layers, while characteristic size becomes original 1/32;
The output characteristic pattern of down-sampling CNN has very high semanteme, but its resolution ratio is low, for original image, Characteristic pattern is during use, the detailed information being lost between many images, the light stream effect that such characterology comes out It is very poor, therefore FlowNet network introduces refining module after down-sampling CNN, improves feature resolution, learns high quality between image Light stream;
The refining module is based on FCN thought, operates using the deconvolution similar to FCN, the resolution ratio of lifting feature, In combination with the detailed information that the output feature supplement of front layer is lost, twin-channel light stream is finally exported;The refining module Network structure are as follows: the increasing of characteristic pattern size is twice by deconvolution first, then with corresponding convolution in down-sampling CNN Layer output characteristic pattern is connected in series together along channel dimension, and as next layer of input, subsequent process is substantially same, no With place in the light stream for learning a correspondingly-sized with a flow branching every time later, and by this light stream along channel dimension It is connected in series to output characteristic pattern, continues to input as next layer;
Step S2.2: for the size of matching characteristic, up-sampling and down-sampling are carried out to light stream;
Step S2.2.1: the current frame image I of given videoiWith its consecutive frame image Ii-t, then FlowNet network is defeated Shown in the following formula of light stream out:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 8 indicate step-length be 8,Indicate light stream Network FlowNet;
Step S2.2.2: up-sampling light stream, obtains the light stream that character pair step-length is 4, shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 4 indicate step-length be 4, upSample () indicates that arest neighbors up-samples function;
Step S2.2.3: carrying out down-sampling to light stream, obtains the light stream that character pair step-length is 16, shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 16 indicate step-length be 16, DownSample () indicates average pond down-sampling;
Step S2.2.4: ifIt is corresponding Wherein C is port number, is defaulted as 2, H and W is respectively the height and width of light stream;It obtains being suitable for the light that multilayer feature is propagated Stream, shown in following formula:
Wherein, s indicates feature step-length;
Step S3: using light stream by the other feature propagation of the multilayer frame level of i-t frame and i+t frame to the i-th frame, multilayer propagation is obtained Feature
In the present embodiment, in order to propagate multilayer feature, identical light stream is used to each layer of same step-length;For example, will Res4a_relu layers to convolutional layer feat_conv_3 × 3_relu layers of expansion are all the light stream propagation characteristics for being 16 with step-length.
The given long light stream of multistepThe propagation characteristic number of plies 1 and the i-th-t frame image Ii-t, then final propagation characteristic is logical Following formula is crossed to be calculated:
Wherein, l indicates the number of plies, and l ∈ (1, n), n are characterized total number of plies of network, corresponding with first row Number in table 1,Indicate l layers of output of character network;Warp mapping function is indicated, by the i-th-t frame feature fi-tThe value of middle position p It is mapped at the corresponding position p+ δ p of present frame i, 6p indicates positional shift;
Then the multilayer propagation characteristic of the i-th+t frame is calculate by the following formula to obtain:
The other characteristic aggregation process of the frame level based on multilayer propagation characteristic, comprising the following steps:
Step C1: by the propagation characteristic of character network first layerPresent frame featureObtain feature net Shown in the following formula of the aggregation features of network first layer:
Wherein,The aggregation features of network first tier are characterized,It is similar for the scaling cosine of polymeric first layers feature Property weight;
Step C2: by the aggregation features of step C 1It is input to the character network second layer as present frame feature, obtains spy SignThe propagation characteristic of the consecutive frame second layer is obtained simultaneouslyAggregation features again obtain the character network second layer The following formula of aggregation features shown in:
Wherein,The aggregation features of the network second layer are characterized,It is similar for the scaling cosine of polymeric second layers feature Property weight;
Step C 3: the above polymerization process is repeated, one by one every layer of aggregation features network of frame level characteristics, and by upper one layer The aggregation features of output are as next layer of present frame feature, and aggregation features until obtaining character network the last layer are as follows Shown in formula:
Wherein,The aggregation features of network n-th layer are characterized,For the scaling cosine similarity for polymerizeing n-th layer feature Weight, n are characterized total number of plies of network;
The aggregation features of the character network n-th layerThe feature as detected eventually for video object,Both it polymerize The temporal information of multiframe, and it has polymerize the spatial information of character network multilayer, significantly enhance the characterization of present frame feature Ability.
The calculation method of the scaling cosine similarity weight of the polymerization n-th layer feature are as follows:
(1), using the Mass Distribution of cosine similarity weight modeling light stream;
Use the mapping network of a shallow-layerBy Feature Mapping to the dimension of dedicated calculation similitude, following formula institute Show:
Wherein,It is characterized fiAnd fi-t→iFeature after mapping,For mapping network;
Given present frame feature fiThe feature f propagated with consecutive framei-t→i, then the cosine at the p of spatial position between them Similitude are as follows:
The weight of formula (14) output is summed along channel, and the weight dimension of output is made to become two-dimensional matrix, and dimension is W × H, The width and height that W and H are characterized respectively make network be easier to train to reduce the weight parameter quantity for needing to learn;
(2), scaling factor directly is extracted from the external appearance characteristic of video frame, the Mass Distribution of video frame is modeled, is obtained The other scaling cosine similarity weight of frame level, and as the other aggregate weight of the frame level of step 4;
Given present frame feature fiWith the propagation characteristic f of the i-th-t framei-t→i, then weight scaling networkThe weight of output is put The contracting factor are as follows:
Due to λi-tFor the other vector of channel level, and cosine similarity weight wi-t→iFor the matrix of 2 dimensional planes, in order to obtain Both the weight of pixel scale, combined by the multiplication of channel level;For each channel c of weight after the scaling of output, each Pixel value at the p of spatial position is calculate by the following formula to obtain:
Wherein,For the other multiplication of channel level;
Cosine similarity weight after obtaining scaling by formula (14), (15), (16);
Correspondingly, the weight of the i-th+t frame propagation characteristic are as follows:
Along the weight of multiframe normalization position p, so thatNormalization operation passes through SoftMax Function is completed;
Two layers before the mapping network and weight scaling network share, make after 1024 dimensional vectors of ResNet-101 output With 1 × 1 convolution sum, 3 × 3 convolution, two continuous convolutional layers, Liang Ge branch subnet is then connected;First branches into volume 1 × 1 Product, as mapping network, for exporting the feature after mappingSecond branch is same Sample is 1 × 1 convolution, and then one overall situation of connection is averaged pond layer as weight scaling network and generates the feature of one 1024 dimension Vector, each channel of corresponding ResNet-101 output feature vector, for measuring the importance degree of feature, when controlling feature Between aggregate weight scaling scale.
The present embodiment selects the output test of three calibrated bolcks of ResNet-101, i.e., to the output res3c_ of res3 block Output conv_3 × 3_feat of the output res4b22_relu and res5 block of relu, res4 block is tested, and the present embodiment exists It is primary every 5 layers of sampling near res3c_relu, it is primary every 3 layers of sampling in res4 block, it finally samples out 9 layers and is surveyed Examination, corresponding number of plies number are (2,7,12,19,21,24,27,30,33), and the mean value mean accuracy of detection is compared such as Fig. 4 institute Show.From fig. 4, it can be seen that the accuracy rate of res4b22_relu is best, the performance of conv_3 × 3_feat is taken second place, res3c_relu Performance it is worst.And since the 17th layer, the layer performance decline of front is very fast, the gap contracting of subsequent layer mean value mean accuracy It is small, reach highest in the 30th layer of detection accuracy.It is more preferable to demonstrate shallow-layer network deeper network characterization propagation performance, but with Shoaling for the network number of plies, this performance can be saturated, or even due to the increase of resolution ratio, and light stream prediction difficulty is caused to increase, whole The decline of body detection performance.
The present embodiment is tested on ImageNet VID verifying collection.The feature propagation number of plies for adjusting FGFA, makes it As the baseline of each level, test result is as shown in table 2.
2 multilayer of table polymerize accuracy comparison with single layer propagation characteristic
Pass through the experimental result of table 2, it can be seen that the feature propagated using res4 the last layer (res4b22_relu) is poly- Conjunction is better than using res5 the last layer (FGFA), therefore more preferable using the performance of shallow-layer network deeper Internet communication feature. It is same from the results, it was seen that propagate the feature of res4 and res5 and polymerization, can further be promoted detection performance (72.1 → 73.6↑1.5), demonstrate promotion of the multilayer feature polymerization to detection accuracy.
In order to further prove promotion of the method for multilayer feature polymerization to the detection performance of Small object, VID is verified Collection according to true frame area be divided into it is small, in, it is big three grouping, as shown in Figure 5.The criteria for classifying of target sizes be area between (0,642) between be classified as it is small, between (642, 1502) between be classified as, be greater than 1502Be classified as it is big.The present embodiment The accounting distribution being respectively grouped that verifying is concentrated is counted, as shown in Figure 5.From figure 5 it can be seen that big target is concentrated in VID verifying Be in the great majority (60.0%), and seldom (13.5%), the present embodiment verifies this of collection in ImageNet VID respectively to Small object quantity Single deep layer (res5 the last layer) feature propagation, single shallow-layer (res4 the last layer) feature are tested in three groupings It propagates and the performance comparison of fusion multilayer (res4+res5 the last layer) propagation characteristic, test result is as shown in table 3.
3 distinct methods of table verify the detection accuracy in collection different size target in ImageNet VID
Method Mean value mean accuracy (%) (small) Mean value mean accuracy (%) (in) Mean value is average smart (%) (big)
FGFA(res5) 26.9 51.4 83.0
FGFA(res4) 29.5 50.8 84.1
FGFA(res4+res5) 30.1 51.9 84.5
As shown in Table 3, shallow-layer characteristic aggregation to the detection performance of Small object be higher than further feature polymerization (26.9% → 29.5%↑ 2.6%), illustrate for small target deteection, the error of shallow-layer feature propagation is influenced than the error that further feature is propagated It is smaller.It polymerize the feature of shallow-layer and deep layer simultaneously, all achieves best detection performance in all subdivisions of verifying collection, say Bright fusion deep layer, the feature of shallow-layer can more comprehensively promote detection performance, and demonstrate multilayer feature polymerization of the invention and calculate Method can merge the advantage of multilayer feature respectively well.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify to technical solution documented by previous embodiment, or some or all of the technical features are equal Replacement;And these are modified or replaceed, model defined by the claims in the present invention that it does not separate the essence of the corresponding technical solution It encloses.

Claims (4)

1. a kind of light stream multilayer frame feature propagation and polymerization towards video object detection, it is characterised in that: including being based on The other feature extraction of the multilayer frame level of light stream and communication process and the other characteristic aggregation process of frame level two based on multilayer propagation characteristic Point;
The other feature extraction of the multilayer frame level based on light stream and communication process, comprising the following steps:
Step S1: the multilayer feature of video consecutive frame is extracted;
Use residual error network ResNet-101 network as the character network for extracting frame level characteristics, ResNet-101 network is not The last three layers of output step-length for having different step-lengths on same layer, and modifying residual block res5 is 16, and finally adding in network Add an expansion convolutional layer, the Feature Dimension Reduction that residual block res5 is exported;
Step S2: being extracted the light stream of video using FlowNet light stream network, and post-processed to light stream, makes it for feature The various sizes of feature of each layer of network carries out size change over;
Step S2.1: the light stream of video is extracted using the Simple version of FlowNet network;It connects and regards directly from channel dimension Adjacent two frame of frequency image, 6 channel images after series connection are input in FlowNet network and extract light stream;
Step S2.2: for the size of matching characteristic, carrying out up-sampling and down-sampling to light stream, obtains being suitable for multilayer feature biography The light stream broadcast;
Step S3: the other feature propagation of the multilayer frame level of i-t frame and i+t frame is obtained into multilayer propagation characteristic to the i-th frame using light stream
The other characteristic aggregation process of the frame level based on multilayer propagation characteristic, comprising the following steps:
Step C1: by the propagation characteristic of character network first layerPresent frame featureObtain character network first Shown in the following formula of aggregation features of layer:
Wherein,The aggregation features of network first tier are characterized,It is weighed for the scaling cosine similarity of polymeric first layers feature Weight;
Step C2: by the aggregation features of step C1It is input to the character network second layer as present frame feature, obtains feature The propagation characteristic of the consecutive frame second layer is obtained simultaneouslyAggregation features again obtain the poly- of the character network second layer It closes shown in the following formula of feature:
Wherein,The aggregation features of the network second layer are characterized,It is weighed for the scaling cosine similarity of polymeric second layers feature Weight;
Step C3: repeating the above polymerization process, one by one every layer of aggregation features network of frame level characteristics, and upper one layer is exported Aggregation features are as next layer of present frame feature, aggregation features until obtaining character network the last layer, following formula institute Show:
Wherein,The aggregation features of network n-th layer are characterized,For the scaling cosine similarity weight for polymerizeing n-th layer feature, n It is characterized total number of plies of network;
The aggregation features of the character network n-th layerThe feature as detected eventually for video object,Both it has polymerize multiframe Temporal information, and polymerize the spatial information of character network multilayer, significantly enhanced the characterization ability of present frame feature;
The calculation method of the scaling cosine similarity weight of the polymerization n-th layer feature are as follows:
(1), using the Mass Distribution of cosine similarity weight modeling light stream;
(2), scaling factor is extracted from the external appearance characteristic of video frame, the Mass Distribution of video frame is modeled, it is other to obtain frame level Scaling cosine similarity weight, and as the other aggregate weight of frame level.
2. a kind of light stream multilayer frame feature propagation and polymerization towards video object detection according to claim 1, It is characterized by: the step S2.2 method particularly includes:
Step S2.2.1: the current frame image I of given videoiWith its consecutive frame image Ii-t, then FlowNet network output light It flows shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 8 indicate step-length be 8,Indicate light stream network FlowNet;
Step S2.2.2: up-sampling light stream, obtains the light stream that character pair step-length is 4, shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 4 indicate step-length be 4, upSample () table Show that arest neighbors up-samples function;
Step S2.2.3: carrying out down-sampling to light stream, obtains the light stream that character pair step-length is 16, shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 16 indicate step-length be 16, downSample () indicates average pond down-sampling;
Step S2.2.4: ifIt is then correspondingWherein C is port number, is defaulted as 2, H and W is respectively the height and width of light stream;It obtains being suitable for the light stream that multilayer feature is propagated, such as Shown in lower formula:
Wherein, s indicates feature step-length.
3. a kind of light stream multilayer frame feature propagation and polymerization towards video object detection according to claim 1, It is characterized by: the step S3 method particularly includes:
The given long light stream of multistepPropagation characteristic number of plies l and the i-th-t frame image Ii-t, then under final propagation characteristic passes through Formula is calculated:
Wherein, l indicates the number of plies, and l ∈ (1, n), n are characterized total number of plies of network,Indicate l layers of output of character network;Warp mapping function is indicated, by the i-th-t frame feature fi-tThe value of middle position p is mapped to the corresponding position p+ δ p of present frame i Place, δ p indicate positional shift;
Then the multilayer propagation characteristic of the i-th+t frame is calculate by the following formula to obtain:
4. a kind of other characteristic aggregation method of frame level towards video object detection according to claim 1, it is characterised in that: The Mass Distribution of cosine similarity weight modeling light stream is used described in step C3 method particularly includes:
Use the mapping network of a shallow-layerBy Feature Mapping to the dimension of dedicated calculation similitude, shown in following formula:
Wherein,It is characterized fiAnd fi-t→iFeature after mapping,For mapping network;
Scaling factor is extracted in the external appearance characteristic from video frame, the Mass Distribution of video frame is modeled, it is other to obtain frame level Scaling cosine similarity weight method particularly includes:
Given present frame feature fiThe feature f propagated with consecutive framei-t→i, then the cosine at the p of spatial position between them is similar Property are as follows:
The weight of formula (14) output is summed along channel, and the weight dimension of output is made to become two-dimensional matrix, and dimension is W × H, W and H The width and height being characterized respectively make network be easier to train to reduce the weight parameter quantity for needing to learn.
Given present frame feature fiWith the propagation characteristic f of the i-th-t framei-t→i, then weight scaling networkThe weight scaling of output because Son are as follows:
Due to λi-tFor the other vector of channel level, and cosine similarity weight wi-t→iFor the matrix of 2 dimensional planes, in order to obtain pixel Both the weight of rank, combined by the multiplication of channel level;For each the channel c, Mei Gekong of the weight after the scaling of output Between pixel value at the p of position, be calculate by the following formula to obtain:
Wherein,For the other multiplication of channel level;
By formula (14), (15, (16) obtain the cosine similarity weight after scaling;
Correspondingly, the weight of the i-th+t frame propagation characteristic are as follows:
Along the weight of multiframe normalization position p, so thatNormalization operation is complete by SoftMax function At;
Two layers before the mapping network and weight scaling network share, 1 is used after 1024 dimensional vectors of ResNet-101 output Then × 1 convolution sum 3 × 3 convolution, two continuous convolutional layers connect Liang Ge branch subnet;First branches into 1 × 1 convolution, As mapping network, for exporting the feature after mappingSecond branch is similarly Then 1 × 1 convolution connects an overall situation and be averaged pond layer, as weight scaling network, one 1024 feature tieed up of generation to Amount, each channel of corresponding ResNet-101 output feature vector, for measuring the importance degree of feature, controlling feature time The scaling scale of aggregate weight.
CN201910230235.2A 2019-03-26 2019-03-26 Optical flow multilayer frame feature propagation and aggregation method for video object detection Active CN109993096B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910230235.2A CN109993096B (en) 2019-03-26 2019-03-26 Optical flow multilayer frame feature propagation and aggregation method for video object detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910230235.2A CN109993096B (en) 2019-03-26 2019-03-26 Optical flow multilayer frame feature propagation and aggregation method for video object detection

Publications (2)

Publication Number Publication Date
CN109993096A true CN109993096A (en) 2019-07-09
CN109993096B CN109993096B (en) 2022-12-20

Family

ID=67131468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910230235.2A Active CN109993096B (en) 2019-03-26 2019-03-26 Optical flow multilayer frame feature propagation and aggregation method for video object detection

Country Status (1)

Country Link
CN (1) CN109993096B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110400305A (en) * 2019-07-26 2019-11-01 哈尔滨理工大学 A kind of object detection method based on deep learning
CN110852199A (en) * 2019-10-28 2020-02-28 中国石化销售股份有限公司华南分公司 Foreground extraction method based on double-frame coding and decoding model
CN111144376A (en) * 2019-12-31 2020-05-12 华南理工大学 Video target detection feature extraction method
CN111860293A (en) * 2020-07-16 2020-10-30 中南民族大学 Remote sensing scene classification method and device, terminal equipment and storage medium
CN111950612A (en) * 2020-07-30 2020-11-17 中国科学院大学 FPN-based weak and small target detection method for fusion factor
CN111968064A (en) * 2020-10-22 2020-11-20 成都睿沿科技有限公司 Image processing method and device, electronic equipment and storage medium
CN112307889A (en) * 2020-09-22 2021-02-02 北京航空航天大学 Face detection algorithm based on small auxiliary network
CN112307872A (en) * 2020-06-12 2021-02-02 北京京东尚科信息技术有限公司 Method and device for detecting target object
CN112394356A (en) * 2020-09-30 2021-02-23 桂林电子科技大学 Small-target unmanned aerial vehicle detection system and method based on U-Net
CN112966581A (en) * 2021-02-25 2021-06-15 厦门大学 Video target detection method based on internal and external semantic aggregation
CN113223044A (en) * 2021-04-21 2021-08-06 西北工业大学 Infrared video target detection method combining feature aggregation and attention mechanism
CN113570608A (en) * 2021-06-30 2021-10-29 北京百度网讯科技有限公司 Target segmentation method and device and electronic equipment
CN113673545A (en) * 2020-05-13 2021-11-19 华为技术有限公司 Optical flow estimation method, related device, equipment and computer readable storage medium
JP2022551396A (en) * 2019-11-20 2022-12-09 テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド Motion recognition method, apparatus, computer program and computer device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108242062A (en) * 2017-12-27 2018-07-03 北京纵目安驰智能科技有限公司 Method for tracking target, system, terminal and medium based on depth characteristic stream
US20180268208A1 (en) * 2017-03-20 2018-09-20 Microsoft Technology Licensing, Llc Feature flow for video recognition
CN109376611A (en) * 2018-09-27 2019-02-22 方玉明 A kind of saliency detection method based on 3D convolutional neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180268208A1 (en) * 2017-03-20 2018-09-20 Microsoft Technology Licensing, Llc Feature flow for video recognition
CN108242062A (en) * 2017-12-27 2018-07-03 北京纵目安驰智能科技有限公司 Method for tracking target, system, terminal and medium based on depth characteristic stream
CN109376611A (en) * 2018-09-27 2019-02-22 方玉明 A kind of saliency detection method based on 3D convolutional neural networks

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110400305A (en) * 2019-07-26 2019-11-01 哈尔滨理工大学 A kind of object detection method based on deep learning
CN110852199A (en) * 2019-10-28 2020-02-28 中国石化销售股份有限公司华南分公司 Foreground extraction method based on double-frame coding and decoding model
JP2022551396A (en) * 2019-11-20 2022-12-09 テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド Motion recognition method, apparatus, computer program and computer device
JP7274048B2 (en) 2019-11-20 2023-05-15 テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド Motion recognition method, apparatus, computer program and computer device
US11928893B2 (en) 2019-11-20 2024-03-12 Tencent Technology (Shenzhen) Company Limited Action recognition method and apparatus, computer storage medium, and computer device
CN111144376A (en) * 2019-12-31 2020-05-12 华南理工大学 Video target detection feature extraction method
CN111144376B (en) * 2019-12-31 2023-12-05 华南理工大学 Video target detection feature extraction method
CN113673545A (en) * 2020-05-13 2021-11-19 华为技术有限公司 Optical flow estimation method, related device, equipment and computer readable storage medium
CN112307872A (en) * 2020-06-12 2021-02-02 北京京东尚科信息技术有限公司 Method and device for detecting target object
CN111860293B (en) * 2020-07-16 2023-12-22 中南民族大学 Remote sensing scene classification method, device, terminal equipment and storage medium
CN111860293A (en) * 2020-07-16 2020-10-30 中南民族大学 Remote sensing scene classification method and device, terminal equipment and storage medium
CN111950612A (en) * 2020-07-30 2020-11-17 中国科学院大学 FPN-based weak and small target detection method for fusion factor
CN112307889A (en) * 2020-09-22 2021-02-02 北京航空航天大学 Face detection algorithm based on small auxiliary network
CN112307889B (en) * 2020-09-22 2022-07-26 北京航空航天大学 Face detection algorithm based on small auxiliary network
CN112394356A (en) * 2020-09-30 2021-02-23 桂林电子科技大学 Small-target unmanned aerial vehicle detection system and method based on U-Net
CN112394356B (en) * 2020-09-30 2024-04-02 桂林电子科技大学 Small target unmanned aerial vehicle detection system and method based on U-Net
CN111968064A (en) * 2020-10-22 2020-11-20 成都睿沿科技有限公司 Image processing method and device, electronic equipment and storage medium
CN111968064B (en) * 2020-10-22 2021-01-15 成都睿沿科技有限公司 Image processing method and device, electronic equipment and storage medium
CN112966581B (en) * 2021-02-25 2022-05-27 厦门大学 Video target detection method based on internal and external semantic aggregation
CN112966581A (en) * 2021-02-25 2021-06-15 厦门大学 Video target detection method based on internal and external semantic aggregation
CN113223044A (en) * 2021-04-21 2021-08-06 西北工业大学 Infrared video target detection method combining feature aggregation and attention mechanism
CN113570608B (en) * 2021-06-30 2023-07-21 北京百度网讯科技有限公司 Target segmentation method and device and electronic equipment
CN113570608A (en) * 2021-06-30 2021-10-29 北京百度网讯科技有限公司 Target segmentation method and device and electronic equipment

Also Published As

Publication number Publication date
CN109993096B (en) 2022-12-20

Similar Documents

Publication Publication Date Title
CN109993096A (en) A kind of light stream multilayer frame feature propagation and polymerization towards video object detection
CN111784602B (en) Method for generating countermeasure network for image restoration
CN109101975A (en) Image, semantic dividing method based on full convolutional neural networks
CN108710863A (en) Unmanned plane Scene Semantics dividing method based on deep learning and system
CN109903255A (en) A kind of high spectrum image Super-Resolution method based on 3D convolutional neural networks
CN109671023A (en) A kind of secondary method for reconstructing of face image super-resolution
CN110490919A (en) A kind of depth estimation method of the monocular vision based on deep neural network
CN109978807A (en) A kind of shadow removal method based on production confrontation network
CN103248906B (en) Method and system for acquiring depth map of binocular stereo video sequence
CN105657402A (en) Depth map recovery method
CN109727270A (en) The movement mechanism and analysis of texture method and system of Cardiac Magnetic Resonance Images
CN111861906A (en) Pavement crack image virtual augmentation model establishment and image virtual augmentation method
CN111914726B (en) Pedestrian detection method based on multichannel self-adaptive attention mechanism
CN110334719A (en) The method and system of object image are built in a kind of extraction remote sensing image
CN110992366A (en) Image semantic segmentation method and device and storage medium
CN114612937A (en) Single-mode enhancement-based infrared and visible light fusion pedestrian detection method
CN107564007A (en) The scene cut modification method and system of amalgamation of global information
CN110246085A (en) A kind of single-image super-resolution method
CN103226825B (en) Based on the method for detecting change of remote sensing image of low-rank sparse model
CN109658361A (en) A kind of moving scene super resolution ratio reconstruction method for taking motion estimation error into account
Choi Utilizing unet for the future traffic map prediction task traffic4cast challenge 2020
CN111915589A (en) Stereo image quality evaluation method based on hole convolution
CN105931189A (en) Video ultra-resolution method and apparatus based on improved ultra-resolution parameterized model
CN111179272A (en) Rapid semantic segmentation method for road scene
CN114463237A (en) Real-time video rain removing method based on global motion compensation and inter-frame time domain correlation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant