CN109993096A - A kind of light stream multilayer frame feature propagation and polymerization towards video object detection - Google Patents
A kind of light stream multilayer frame feature propagation and polymerization towards video object detection Download PDFInfo
- Publication number
- CN109993096A CN109993096A CN201910230235.2A CN201910230235A CN109993096A CN 109993096 A CN109993096 A CN 109993096A CN 201910230235 A CN201910230235 A CN 201910230235A CN 109993096 A CN109993096 A CN 109993096A
- Authority
- CN
- China
- Prior art keywords
- feature
- frame
- network
- light stream
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The present invention provides a kind of light stream multilayer frame feature propagation and polymerization towards video object detection, is related to technical field of computer vision.This method extracts the multilayer feature of consecutive frame by character network first, light stream network extracts light stream, then utilize light stream by the other feature propagation of the multilayer frame level of the former frame of present frame and a later frame of present frame to present frame, the different layer of step-length needs to do light stream up-sampling or down-sampling, obtains multilayer propagation characteristic;Then successively successively it polymerize every layer of propagation characteristic, the frame level characteristics for ultimately producing multilayer polymeric are detected for last video object.Light stream multilayer frame feature propagation and polymerization provided by the invention towards video object detection, so that the advantages of other aggregation features of frame level of output have taken into account shallow-layer network high resolution and deep layer network higher-dimension semantic feature, detection performance can be promoted, and the method for multilayer feature polymerization promotes the detection performance of Small object.
Description
Technical field
The present invention relates to technical field of computer vision more particularly to a kind of light stream multilayer frames towards video object detection
Feature propagation and polymerization.
Background technique
Video object detection method can mainly be divided into two classes both at home and abroad at present, and one kind is the method for frame level, another
Class is the method for the characteristic level based on light stream.In recent years, researcher focused on the high semantic feature of deep-neural-network extraction
Level models the motion information between video frame by light stream, using the light stream of interframe by the feature propagation of consecutive frame to working as
The advantages of feature of previous frame, prediction or enhancing present frame, this method is clear thinking, simple and effective, and can be end-to-end
Training pattern.Although light stream can be used for the spatial alternation of feature hierarchy, the feature of interframe is propagated using Optic flow information
There are errors, such as DFF and FGFA to have used the last one residual block of residual error network res5 to mention when propagating the feature between frame
The feature taken, but since there are errors for light stream network, local feature is caused to be misaligned, cause two problems: first is that res5
The feature resolution of extraction is low, semantic hierarchies are high, and the semantic information that each pixel includes is very abundant, if in these presence
It is directly detected on the propagation characteristic of error or is detected again after polymerizeing, and do not had to certain methods and correct these errors
Pixel will have a direct impact on the performance of detection;Second is that residual block res5 extracts each pixel of feature on the original image
Receptive field is larger, and some lesser targets in video are lower than 64 × 64 resolution ratio, in the corresponding characteristic value model of residual block res5
It encloses lower than 4 × 4, the influence that the error of single pixel point generates the detection of these Small objects, which is much larger than, is higher than 150 to biggish
The big target detection of × 150 resolution ratio.In image object detection field, usually carried out simultaneously using the feature of character network multilayer
Detection, to improve the detection accuracy of detection accuracy, especially Small object, referred to as feature pyramid, typical method such as SSD,
FPN, the feature that above method demonstrates character network different levels have their own advantages, and joint multilayer detects together can effectively promote inspection
Survey precision.
Summary of the invention
It is a kind of towards video object inspection the technical problem to be solved by the present invention is in view of the above shortcomings of the prior art, provide
The light stream multilayer frame feature propagation and polymerization of survey, realize to the propagation of Optical-flow Feature with polymerize.
In order to solve the above technical problems, the technical solution used in the present invention is: a kind of light towards video object detection
Multilayer frame feature propagation and polymerization are flowed, including the other feature extraction of multilayer frame level and communication process based on light stream and based on more
The other characteristic aggregation process two parts of the frame level of Es-region propagations feature;
The other feature extraction of the multilayer frame level based on light stream and communication process, comprising the following steps:
Step S1: the multilayer feature of video consecutive frame is extracted;
Use residual error network ResNet-101 network as the character network for extracting frame level characteristics;The ResNet-101
Network has different step-lengths on different layers, and last three layers of output step-length of modification residual block res5 is 16, and in network
Finally add an expansion convolutional layer, the Feature Dimension Reduction that residual block res5 is exported;
Step S2: the light stream of video is extracted using FlowNet light stream network, and light stream is post-processed, is directed to it
The various sizes of feature of each layer of character network carries out size change over;
Step S2.1: the light stream of video is extracted using the Simple version of FlowNet network;Directly gone here and there from channel dimension
Adjacent two frame for joining video image, 6 channel images after series connection are input in FlowNet network and extract light stream;
Step S2.2: for the size of matching characteristic, up-sampling and down-sampling are carried out to light stream;
Step S2.2.1: the current frame image I of given videoiWith its consecutive frame image Ii-t, then FlowNet network is defeated
Shown in the following formula of light stream out:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 8 indicate step-length be 8,Indicate light stream
Network FlowNet;
Step S2.2.2: up-sampling light stream, obtains the light stream that character pair step-length is 4, shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 4 indicate step-length be 4, upSample
() indicates that arest neighbors up-samples function;
Step S2.2.3: carrying out down-sampling to light stream, obtains the light stream that character pair step-length is 16, shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 16 indicate step-length be 16,
DownSample () indicates average pond down-sampling;
Step S2.2.4: ifIt is then corresponding
Wherein C is port number, is defaulted as 2, H and W is respectively the height and width of light stream;It obtains being suitable for the light that multilayer feature is propagated
Stream, shown in following formula:
Wherein, s indicates feature step-length;
Step S3: using light stream by the other feature propagation of the multilayer frame level of i-t frame and i+t frame to the i-th frame, multilayer propagation is obtained
Feature
The given long light stream of multistepPropagation characteristic number of plies l and the i-th-t frame image Ii-t, then final propagation characteristic is logical
Following formula is crossed to be calculated:
Wherein, l indicates the number of plies, and l ∈ (1, n), n are characterized total number of plies of network,Indicate l layers of character network
Output;Warp mapping function is indicated, by the i-th-t frame feature fi-tThe value of middle position p is mapped to the corresponding position of present frame i
At p+ δ p, δ p indicates positional shift;
Then the multilayer propagation characteristic of the i-th+t frame is calculate by the following formula to obtain:
The other characteristic aggregation process of the frame level based on multilayer propagation characteristic, comprising the following steps:
Step C1: by the propagation characteristic of character network first layerPresent frame featureObtain character network
Shown in the following formula of the aggregation features of first layer:
Wherein,The aggregation features of network first tier are characterized,It is similar for the scaling cosine of polymeric first layers feature
Property weight;
Step C2: by the aggregation features of step C1It is input to the character network second layer as present frame feature, obtains spy
SignThe propagation characteristic of the consecutive frame second layer is obtained simultaneouslyAggregation features again obtain character network second
Shown in the following formula of aggregation features of layer:
Wherein,The aggregation features of the network second layer are characterized,It is similar for the scaling cosine of polymeric second layers feature
Property weight;
Step C3: the above polymerization process is repeated, one by one every layer of aggregation features network of frame level characteristics, and defeated by upper one layer
Aggregation features out are as next layer of present frame feature, and aggregation features until obtaining character network the last layer are following public
Shown in formula:
Wherein,The aggregation features of network n-th layer are characterized,For the scaling cosine similarity for polymerizeing n-th layer feature
Weight, n are characterized total number of plies of network;
The aggregation features of the character network n-th layerThe feature as detected eventually for video object,Both it polymerize
The temporal information of multiframe, and it has polymerize the spatial information of character network multilayer;
The calculation method of the scaling cosine similarity weight of the polymerization n-th layer feature are as follows:
(1), using the Mass Distribution of cosine similarity weight modeling light stream;
Use the mapping network of a shallow-layerBy Feature Mapping to the dimension of dedicated calculation similitude, following formula institute
Show:
Wherein,It is characterized fiAnd fi-t→iFeature after mapping,For mapping network;
Given present frame feature fiThe feature f propagated with consecutive framei-t→i, then the cosine at the p of spatial position between them
Similitude are as follows:
The weight of formula (14) output is summed along channel, and the weight dimension of output is made to become two-dimensional matrix, and dimension is W × H,
The width and height that W and H are characterized respectively make network be easier to train to reduce the weight parameter quantity for needing to learn;
(2), scaling factor directly is extracted from the external appearance characteristic of video frame, the Mass Distribution of video frame is modeled, is obtained
The other scaling cosine similarity weight of frame level, and as the other aggregate weight of the frame level of step 4;
Given present frame feature fiWith the propagation characteristic f of the i-th-t framei-t→i, then weight scaling networkThe weight of output is put
The contracting factor are as follows:
Due to λi-tFor the other vector of channel level, and cosine similarity weight wi-t→iFor the matrix of 2 dimensional planes, in order to obtain
Both the weight of pixel scale, combined by the multiplication of channel level;For each channel c of weight after the scaling of output, each
Pixel value at the p of spatial position is calculate by the following formula to obtain:
Wherein,For the other multiplication of channel level;
Cosine similarity weight after obtaining scaling by formula (14), (15), (16);
Correspondingly, the weight of the i-th+t frame propagation characteristic are as follows:
Along the weight of multiframe normalization position p, so thatNormalization operation passes through SoftMax
Function is completed;
Two layers before the mapping network and weight scaling network share, make after 1024 dimensional vectors of ResNet-101 output
With 1 × 1 convolution sum, 3 × 3 convolution, two continuous convolutional layers, Liang Ge branch subnet is then connected;First branches into volume 1 × 1
Product, as mapping network, for exporting the feature after mappingSecond branch is same
Sample is 1 × 1 convolution, and then one overall situation of connection is averaged pond layer as weight scaling network and generates the feature of one 1024 dimension
Vector, each channel of corresponding ResNet-101 output feature vector, for measuring the importance degree of feature, when controlling feature
Between aggregate weight scaling scale.
The beneficial effects of adopting the technical scheme are that provided by the invention a kind of towards video object detection
Light stream multilayer frame feature propagation and polymerization, character network shallow-layer output (res3 layers, res4 layers) on propagation characteristic,
One side shallow-layer network high resolution is higher to the serious forgiveness of Small object when feature propagation;The propagation of another aspect shallow-layer network
Error can be weakened by subsequent network, or even gradually be corrected.Then, in the shallow-layer of character network and deep layer while propagation characteristic
And it polymerize deep layer and shallow-layer feature, the high semantic feature of deep layer network was not only utilized in this way, but also remain the high score of shallow-layer feature
Resolution.So that the other aggregation features of frame level of output have taken into account the excellent of shallow-layer network high resolution and deep layer network higher-dimension semantic feature
Point can promote detection performance, and the method for multilayer feature polymerization promotes the detection performance of Small object.
Detailed description of the invention
Fig. 1 is a kind of light stream multilayer frame feature propagation and polymerization towards video object detection provided in an embodiment of the present invention
The flow chart of method;
Fig. 2 is the schematic diagram of the multilayer feature propagation and its polymerization process provided in an embodiment of the present invention based on light stream;
Fig. 3 is the schematic diagram of FlowNet network structure provided in an embodiment of the present invention (simple version);
Fig. 4 is the comparison diagram of heterogeneous networks layer detection performance provided in an embodiment of the present invention;
Fig. 5 is the true frame area distributions histogram of ImageNet VID provided in an embodiment of the present invention verifying collection and its divides
Group divides.
Specific embodiment
With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below
Example is not intended to limit the scope of the invention for illustrating the present invention.
This implementation is by taking sets of video data ImageNet VID as an example, using of the invention a kind of towards video object detection
Light stream multilayer frame feature propagation and its polymerization verify the video data;
A kind of light stream multilayer frame feature propagation and polymerization towards video object detection, as depicted in figs. 1 and 2, packet
Include the other feature extraction of multilayer frame level and communication process and the other characteristic aggregation mistake of frame level based on multilayer propagation characteristic based on light stream
Journey two parts;
The other feature extraction of the multilayer frame level based on light stream and communication process, comprising the following steps:
Step S1: the multilayer feature of video consecutive frame is extracted;
Use residual error network ResNet-101 network as the character network for extracting frame level characteristics;The ResNet-101
Network has different step-lengths on different layers, with reference to R-FCN network, and modifies last three layers of output step-length of residual block res5
It is 16, and in one expansion convolutional layer of last addition of network, the Feature Dimension Reduction that res5 is exported;
The present embodiment, use the ResNet-101 network of modification as extract frame level characteristics character network, each layer
Detailed step-length and space scale statistical data are shown in Table 1.ResNet-101 has different step-lengths on the different layer of network, and modification is most
The output step-length of three layers of res5a_relu, res5a_relu, res5b_relu are 16 afterwards, and add a dilate=6,
Expansion convolutional layer feat_conv_3 × 3_relu of kernel=3, pad=6, num_filters=1024.
Each layer step-length statistics of 1 ResNet-101 of table
Number | Each layer of ResNet-101 | Step-length | Size |
1 | res2a_relu | 4 | 1/4 |
2 | res2b_relu | 4 | 1/4 |
3 | res2c_relu | 4 | 1/4 |
4 | res3a_relu | 8 | 1/8 |
5 | res3b1_relu | 8 | 1/8 |
6 | res3b2_relu | 8 | 1/8 |
7 | res3b3_relu | 8 | 1/8 |
8 | res4a_relu | 16 | 1/16 |
9 | res4b1_relu | 16 | 1/16 |
10 | res4b2_relu | 16 | 1/16 |
… | … | … | … |
30 | res4b22_relu | 16 | 1/16 |
31 | res5a_relu | 16 | 1/16 |
32 | res5b_relu | 16 | 1/16 |
33 | feat_conv_3×3_relu | 16 | 1/16 |
Due to the architectural characteristic of residual error network, the present embodiment has only counted the output layer of residual error module, and interior layer does not count,
It will not be used for feature propagation, Number to indicate the number of corresponding network layer, and Layers lists ResNet-101 except preceding two
All-network layer output outside layer, stride indicate the feature step-length of corresponding network layer output, and spatial_scale indicates to correspond to
Scale/original image scale of layer output;In the present embodiment, res2b_relu layers, res3b3_relu layers, res4b22_ are used
Relu layers and feat_conv_3 × 3_relu layers of progress multilayer feature propagation.
Step S2: the light stream of video is extracted using FlowNet light stream network, and light stream is post-processed, is directed to it
The various sizes of feature of each layer of character network carries out size change over;
Step S2.1: the light stream of video is extracted using the Simple version of FlowNet network as shown in Figure 3;Directly from
Adjacent two frame of series connection video image, 6 channel images after series connection are input in FlowNet network and extract light on channel dimension
Stream;
The FlowNet network extracts the feature comprising higher-dimension semantic information of two field pictures by down-sampling CNN;
It a use of window size is first 2 × 2, the average pond layer that step-length is 2 will be originally inputted dimension of picture and halve,
Then promote feature abstraction level by 9 continuous convolutional layers, while characteristic size becomes original 1/32;
The output characteristic pattern of down-sampling CNN has very high semanteme, but its resolution ratio is low, for original image,
Characteristic pattern is during use, the detailed information being lost between many images, the light stream effect that such characterology comes out
It is very poor, therefore FlowNet network introduces refining module after down-sampling CNN, improves feature resolution, learns high quality between image
Light stream;
The refining module is based on FCN thought, operates using the deconvolution similar to FCN, the resolution ratio of lifting feature,
In combination with the detailed information that the output feature supplement of front layer is lost, twin-channel light stream is finally exported;The refining module
Network structure are as follows: the increasing of characteristic pattern size is twice by deconvolution first, then with corresponding convolution in down-sampling CNN
Layer output characteristic pattern is connected in series together along channel dimension, and as next layer of input, subsequent process is substantially same, no
With place in the light stream for learning a correspondingly-sized with a flow branching every time later, and by this light stream along channel dimension
It is connected in series to output characteristic pattern, continues to input as next layer;
Step S2.2: for the size of matching characteristic, up-sampling and down-sampling are carried out to light stream;
Step S2.2.1: the current frame image I of given videoiWith its consecutive frame image Ii-t, then FlowNet network is defeated
Shown in the following formula of light stream out:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 8 indicate step-length be 8,Indicate light stream
Network FlowNet;
Step S2.2.2: up-sampling light stream, obtains the light stream that character pair step-length is 4, shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 4 indicate step-length be 4, upSample
() indicates that arest neighbors up-samples function;
Step S2.2.3: carrying out down-sampling to light stream, obtains the light stream that character pair step-length is 16, shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 16 indicate step-length be 16,
DownSample () indicates average pond down-sampling;
Step S2.2.4: ifIt is corresponding
Wherein C is port number, is defaulted as 2, H and W is respectively the height and width of light stream;It obtains being suitable for the light that multilayer feature is propagated
Stream, shown in following formula:
Wherein, s indicates feature step-length;
Step S3: using light stream by the other feature propagation of the multilayer frame level of i-t frame and i+t frame to the i-th frame, multilayer propagation is obtained
Feature
In the present embodiment, in order to propagate multilayer feature, identical light stream is used to each layer of same step-length;For example, will
Res4a_relu layers to convolutional layer feat_conv_3 × 3_relu layers of expansion are all the light stream propagation characteristics for being 16 with step-length.
The given long light stream of multistepThe propagation characteristic number of plies 1 and the i-th-t frame image Ii-t, then final propagation characteristic is logical
Following formula is crossed to be calculated:
Wherein, l indicates the number of plies, and l ∈ (1, n), n are characterized total number of plies of network, corresponding with first row Number in table 1,Indicate l layers of output of character network;Warp mapping function is indicated, by the i-th-t frame feature fi-tThe value of middle position p
It is mapped at the corresponding position p+ δ p of present frame i, 6p indicates positional shift;
Then the multilayer propagation characteristic of the i-th+t frame is calculate by the following formula to obtain:
The other characteristic aggregation process of the frame level based on multilayer propagation characteristic, comprising the following steps:
Step C1: by the propagation characteristic of character network first layerPresent frame featureObtain feature net
Shown in the following formula of the aggregation features of network first layer:
Wherein,The aggregation features of network first tier are characterized,It is similar for the scaling cosine of polymeric first layers feature
Property weight;
Step C2: by the aggregation features of step C 1It is input to the character network second layer as present frame feature, obtains spy
SignThe propagation characteristic of the consecutive frame second layer is obtained simultaneouslyAggregation features again obtain the character network second layer
The following formula of aggregation features shown in:
Wherein,The aggregation features of the network second layer are characterized,It is similar for the scaling cosine of polymeric second layers feature
Property weight;
Step C 3: the above polymerization process is repeated, one by one every layer of aggregation features network of frame level characteristics, and by upper one layer
The aggregation features of output are as next layer of present frame feature, and aggregation features until obtaining character network the last layer are as follows
Shown in formula:
Wherein,The aggregation features of network n-th layer are characterized,For the scaling cosine similarity for polymerizeing n-th layer feature
Weight, n are characterized total number of plies of network;
The aggregation features of the character network n-th layerThe feature as detected eventually for video object,Both it polymerize
The temporal information of multiframe, and it has polymerize the spatial information of character network multilayer, significantly enhance the characterization of present frame feature
Ability.
The calculation method of the scaling cosine similarity weight of the polymerization n-th layer feature are as follows:
(1), using the Mass Distribution of cosine similarity weight modeling light stream;
Use the mapping network of a shallow-layerBy Feature Mapping to the dimension of dedicated calculation similitude, following formula institute
Show:
Wherein,It is characterized fiAnd fi-t→iFeature after mapping,For mapping network;
Given present frame feature fiThe feature f propagated with consecutive framei-t→i, then the cosine at the p of spatial position between them
Similitude are as follows:
The weight of formula (14) output is summed along channel, and the weight dimension of output is made to become two-dimensional matrix, and dimension is W × H,
The width and height that W and H are characterized respectively make network be easier to train to reduce the weight parameter quantity for needing to learn;
(2), scaling factor directly is extracted from the external appearance characteristic of video frame, the Mass Distribution of video frame is modeled, is obtained
The other scaling cosine similarity weight of frame level, and as the other aggregate weight of the frame level of step 4;
Given present frame feature fiWith the propagation characteristic f of the i-th-t framei-t→i, then weight scaling networkThe weight of output is put
The contracting factor are as follows:
Due to λi-tFor the other vector of channel level, and cosine similarity weight wi-t→iFor the matrix of 2 dimensional planes, in order to obtain
Both the weight of pixel scale, combined by the multiplication of channel level;For each channel c of weight after the scaling of output, each
Pixel value at the p of spatial position is calculate by the following formula to obtain:
Wherein,For the other multiplication of channel level;
Cosine similarity weight after obtaining scaling by formula (14), (15), (16);
Correspondingly, the weight of the i-th+t frame propagation characteristic are as follows:
Along the weight of multiframe normalization position p, so thatNormalization operation passes through SoftMax
Function is completed;
Two layers before the mapping network and weight scaling network share, make after 1024 dimensional vectors of ResNet-101 output
With 1 × 1 convolution sum, 3 × 3 convolution, two continuous convolutional layers, Liang Ge branch subnet is then connected;First branches into volume 1 × 1
Product, as mapping network, for exporting the feature after mappingSecond branch is same
Sample is 1 × 1 convolution, and then one overall situation of connection is averaged pond layer as weight scaling network and generates the feature of one 1024 dimension
Vector, each channel of corresponding ResNet-101 output feature vector, for measuring the importance degree of feature, when controlling feature
Between aggregate weight scaling scale.
The present embodiment selects the output test of three calibrated bolcks of ResNet-101, i.e., to the output res3c_ of res3 block
Output conv_3 × 3_feat of the output res4b22_relu and res5 block of relu, res4 block is tested, and the present embodiment exists
It is primary every 5 layers of sampling near res3c_relu, it is primary every 3 layers of sampling in res4 block, it finally samples out 9 layers and is surveyed
Examination, corresponding number of plies number are (2,7,12,19,21,24,27,30,33), and the mean value mean accuracy of detection is compared such as Fig. 4 institute
Show.From fig. 4, it can be seen that the accuracy rate of res4b22_relu is best, the performance of conv_3 × 3_feat is taken second place, res3c_relu
Performance it is worst.And since the 17th layer, the layer performance decline of front is very fast, the gap contracting of subsequent layer mean value mean accuracy
It is small, reach highest in the 30th layer of detection accuracy.It is more preferable to demonstrate shallow-layer network deeper network characterization propagation performance, but with
Shoaling for the network number of plies, this performance can be saturated, or even due to the increase of resolution ratio, and light stream prediction difficulty is caused to increase, whole
The decline of body detection performance.
The present embodiment is tested on ImageNet VID verifying collection.The feature propagation number of plies for adjusting FGFA, makes it
As the baseline of each level, test result is as shown in table 2.
2 multilayer of table polymerize accuracy comparison with single layer propagation characteristic
Pass through the experimental result of table 2, it can be seen that the feature propagated using res4 the last layer (res4b22_relu) is poly-
Conjunction is better than using res5 the last layer (FGFA), therefore more preferable using the performance of shallow-layer network deeper Internet communication feature.
It is same from the results, it was seen that propagate the feature of res4 and res5 and polymerization, can further be promoted detection performance (72.1 →
73.6↑1.5), demonstrate promotion of the multilayer feature polymerization to detection accuracy.
In order to further prove promotion of the method for multilayer feature polymerization to the detection performance of Small object, VID is verified
Collection according to true frame area be divided into it is small, in, it is big three grouping, as shown in Figure 5.The criteria for classifying of target sizes be area between
(0,642) between be classified as it is small, between (642, 1502) between be classified as, be greater than 1502Be classified as it is big.The present embodiment
The accounting distribution being respectively grouped that verifying is concentrated is counted, as shown in Figure 5.From figure 5 it can be seen that big target is concentrated in VID verifying
Be in the great majority (60.0%), and seldom (13.5%), the present embodiment verifies this of collection in ImageNet VID respectively to Small object quantity
Single deep layer (res5 the last layer) feature propagation, single shallow-layer (res4 the last layer) feature are tested in three groupings
It propagates and the performance comparison of fusion multilayer (res4+res5 the last layer) propagation characteristic, test result is as shown in table 3.
3 distinct methods of table verify the detection accuracy in collection different size target in ImageNet VID
Method | Mean value mean accuracy (%) (small) | Mean value mean accuracy (%) (in) | Mean value is average smart (%) (big) |
FGFA(res5) | 26.9 | 51.4 | 83.0 |
FGFA(res4) | 29.5 | 50.8 | 84.1 |
FGFA(res4+res5) | 30.1 | 51.9 | 84.5 |
As shown in Table 3, shallow-layer characteristic aggregation to the detection performance of Small object be higher than further feature polymerization (26.9% →
29.5%↑ 2.6%), illustrate for small target deteection, the error of shallow-layer feature propagation is influenced than the error that further feature is propagated
It is smaller.It polymerize the feature of shallow-layer and deep layer simultaneously, all achieves best detection performance in all subdivisions of verifying collection, say
Bright fusion deep layer, the feature of shallow-layer can more comprehensively promote detection performance, and demonstrate multilayer feature polymerization of the invention and calculate
Method can merge the advantage of multilayer feature respectively well.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although
Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used
To modify to technical solution documented by previous embodiment, or some or all of the technical features are equal
Replacement;And these are modified or replaceed, model defined by the claims in the present invention that it does not separate the essence of the corresponding technical solution
It encloses.
Claims (4)
1. a kind of light stream multilayer frame feature propagation and polymerization towards video object detection, it is characterised in that: including being based on
The other feature extraction of the multilayer frame level of light stream and communication process and the other characteristic aggregation process of frame level two based on multilayer propagation characteristic
Point;
The other feature extraction of the multilayer frame level based on light stream and communication process, comprising the following steps:
Step S1: the multilayer feature of video consecutive frame is extracted;
Use residual error network ResNet-101 network as the character network for extracting frame level characteristics, ResNet-101 network is not
The last three layers of output step-length for having different step-lengths on same layer, and modifying residual block res5 is 16, and finally adding in network
Add an expansion convolutional layer, the Feature Dimension Reduction that residual block res5 is exported;
Step S2: being extracted the light stream of video using FlowNet light stream network, and post-processed to light stream, makes it for feature
The various sizes of feature of each layer of network carries out size change over;
Step S2.1: the light stream of video is extracted using the Simple version of FlowNet network;It connects and regards directly from channel dimension
Adjacent two frame of frequency image, 6 channel images after series connection are input in FlowNet network and extract light stream;
Step S2.2: for the size of matching characteristic, carrying out up-sampling and down-sampling to light stream, obtains being suitable for multilayer feature biography
The light stream broadcast;
Step S3: the other feature propagation of the multilayer frame level of i-t frame and i+t frame is obtained into multilayer propagation characteristic to the i-th frame using light stream
The other characteristic aggregation process of the frame level based on multilayer propagation characteristic, comprising the following steps:
Step C1: by the propagation characteristic of character network first layerPresent frame featureObtain character network first
Shown in the following formula of aggregation features of layer:
Wherein,The aggregation features of network first tier are characterized,It is weighed for the scaling cosine similarity of polymeric first layers feature
Weight;
Step C2: by the aggregation features of step C1It is input to the character network second layer as present frame feature, obtains feature
The propagation characteristic of the consecutive frame second layer is obtained simultaneouslyAggregation features again obtain the poly- of the character network second layer
It closes shown in the following formula of feature:
Wherein,The aggregation features of the network second layer are characterized,It is weighed for the scaling cosine similarity of polymeric second layers feature
Weight;
Step C3: repeating the above polymerization process, one by one every layer of aggregation features network of frame level characteristics, and upper one layer is exported
Aggregation features are as next layer of present frame feature, aggregation features until obtaining character network the last layer, following formula institute
Show:
Wherein,The aggregation features of network n-th layer are characterized,For the scaling cosine similarity weight for polymerizeing n-th layer feature, n
It is characterized total number of plies of network;
The aggregation features of the character network n-th layerThe feature as detected eventually for video object,Both it has polymerize multiframe
Temporal information, and polymerize the spatial information of character network multilayer, significantly enhanced the characterization ability of present frame feature;
The calculation method of the scaling cosine similarity weight of the polymerization n-th layer feature are as follows:
(1), using the Mass Distribution of cosine similarity weight modeling light stream;
(2), scaling factor is extracted from the external appearance characteristic of video frame, the Mass Distribution of video frame is modeled, it is other to obtain frame level
Scaling cosine similarity weight, and as the other aggregate weight of frame level.
2. a kind of light stream multilayer frame feature propagation and polymerization towards video object detection according to claim 1,
It is characterized by: the step S2.2 method particularly includes:
Step S2.2.1: the current frame image I of given videoiWith its consecutive frame image Ii-t, then FlowNet network output light
It flows shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 8 indicate step-length be 8,Indicate light stream network
FlowNet;
Step S2.2.2: up-sampling light stream, obtains the light stream that character pair step-length is 4, shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 4 indicate step-length be 4, upSample () table
Show that arest neighbors up-samples function;
Step S2.2.3: carrying out down-sampling to light stream, obtains the light stream that character pair step-length is 16, shown in following formula:
Wherein,Indicate present frame IiFrame I adjacent theretoi-tLight stream, subscript 16 indicate step-length be 16, downSample
() indicates average pond down-sampling;
Step S2.2.4: ifIt is then correspondingWherein
C is port number, is defaulted as 2, H and W is respectively the height and width of light stream;It obtains being suitable for the light stream that multilayer feature is propagated, such as
Shown in lower formula:
Wherein, s indicates feature step-length.
3. a kind of light stream multilayer frame feature propagation and polymerization towards video object detection according to claim 1,
It is characterized by: the step S3 method particularly includes:
The given long light stream of multistepPropagation characteristic number of plies l and the i-th-t frame image Ii-t, then under final propagation characteristic passes through
Formula is calculated:
Wherein, l indicates the number of plies, and l ∈ (1, n), n are characterized total number of plies of network,Indicate l layers of output of character network;Warp mapping function is indicated, by the i-th-t frame feature fi-tThe value of middle position p is mapped to the corresponding position p+ δ p of present frame i
Place, δ p indicate positional shift;
Then the multilayer propagation characteristic of the i-th+t frame is calculate by the following formula to obtain:
4. a kind of other characteristic aggregation method of frame level towards video object detection according to claim 1, it is characterised in that:
The Mass Distribution of cosine similarity weight modeling light stream is used described in step C3 method particularly includes:
Use the mapping network of a shallow-layerBy Feature Mapping to the dimension of dedicated calculation similitude, shown in following formula:
Wherein,It is characterized fiAnd fi-t→iFeature after mapping,For mapping network;
Scaling factor is extracted in the external appearance characteristic from video frame, the Mass Distribution of video frame is modeled, it is other to obtain frame level
Scaling cosine similarity weight method particularly includes:
Given present frame feature fiThe feature f propagated with consecutive framei-t→i, then the cosine at the p of spatial position between them is similar
Property are as follows:
The weight of formula (14) output is summed along channel, and the weight dimension of output is made to become two-dimensional matrix, and dimension is W × H, W and H
The width and height being characterized respectively make network be easier to train to reduce the weight parameter quantity for needing to learn.
Given present frame feature fiWith the propagation characteristic f of the i-th-t framei-t→i, then weight scaling networkThe weight scaling of output because
Son are as follows:
Due to λi-tFor the other vector of channel level, and cosine similarity weight wi-t→iFor the matrix of 2 dimensional planes, in order to obtain pixel
Both the weight of rank, combined by the multiplication of channel level;For each the channel c, Mei Gekong of the weight after the scaling of output
Between pixel value at the p of position, be calculate by the following formula to obtain:
Wherein,For the other multiplication of channel level;
By formula (14), (15, (16) obtain the cosine similarity weight after scaling;
Correspondingly, the weight of the i-th+t frame propagation characteristic are as follows:
Along the weight of multiframe normalization position p, so thatNormalization operation is complete by SoftMax function
At;
Two layers before the mapping network and weight scaling network share, 1 is used after 1024 dimensional vectors of ResNet-101 output
Then × 1 convolution sum 3 × 3 convolution, two continuous convolutional layers connect Liang Ge branch subnet;First branches into 1 × 1 convolution,
As mapping network, for exporting the feature after mappingSecond branch is similarly
Then 1 × 1 convolution connects an overall situation and be averaged pond layer, as weight scaling network, one 1024 feature tieed up of generation to
Amount, each channel of corresponding ResNet-101 output feature vector, for measuring the importance degree of feature, controlling feature time
The scaling scale of aggregate weight.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910230235.2A CN109993096B (en) | 2019-03-26 | 2019-03-26 | Optical flow multilayer frame feature propagation and aggregation method for video object detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910230235.2A CN109993096B (en) | 2019-03-26 | 2019-03-26 | Optical flow multilayer frame feature propagation and aggregation method for video object detection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109993096A true CN109993096A (en) | 2019-07-09 |
CN109993096B CN109993096B (en) | 2022-12-20 |
Family
ID=67131468
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910230235.2A Active CN109993096B (en) | 2019-03-26 | 2019-03-26 | Optical flow multilayer frame feature propagation and aggregation method for video object detection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109993096B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110400305A (en) * | 2019-07-26 | 2019-11-01 | 哈尔滨理工大学 | A kind of object detection method based on deep learning |
CN110852199A (en) * | 2019-10-28 | 2020-02-28 | 中国石化销售股份有限公司华南分公司 | Foreground extraction method based on double-frame coding and decoding model |
CN111144376A (en) * | 2019-12-31 | 2020-05-12 | 华南理工大学 | Video target detection feature extraction method |
CN111860293A (en) * | 2020-07-16 | 2020-10-30 | 中南民族大学 | Remote sensing scene classification method and device, terminal equipment and storage medium |
CN111950612A (en) * | 2020-07-30 | 2020-11-17 | 中国科学院大学 | FPN-based weak and small target detection method for fusion factor |
CN111968064A (en) * | 2020-10-22 | 2020-11-20 | 成都睿沿科技有限公司 | Image processing method and device, electronic equipment and storage medium |
CN112307889A (en) * | 2020-09-22 | 2021-02-02 | 北京航空航天大学 | Face detection algorithm based on small auxiliary network |
CN112307872A (en) * | 2020-06-12 | 2021-02-02 | 北京京东尚科信息技术有限公司 | Method and device for detecting target object |
CN112394356A (en) * | 2020-09-30 | 2021-02-23 | 桂林电子科技大学 | Small-target unmanned aerial vehicle detection system and method based on U-Net |
CN112966581A (en) * | 2021-02-25 | 2021-06-15 | 厦门大学 | Video target detection method based on internal and external semantic aggregation |
CN113223044A (en) * | 2021-04-21 | 2021-08-06 | 西北工业大学 | Infrared video target detection method combining feature aggregation and attention mechanism |
CN113570608A (en) * | 2021-06-30 | 2021-10-29 | 北京百度网讯科技有限公司 | Target segmentation method and device and electronic equipment |
CN113673545A (en) * | 2020-05-13 | 2021-11-19 | 华为技术有限公司 | Optical flow estimation method, related device, equipment and computer readable storage medium |
JP2022551396A (en) * | 2019-11-20 | 2022-12-09 | テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド | Motion recognition method, apparatus, computer program and computer device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108242062A (en) * | 2017-12-27 | 2018-07-03 | 北京纵目安驰智能科技有限公司 | Method for tracking target, system, terminal and medium based on depth characteristic stream |
US20180268208A1 (en) * | 2017-03-20 | 2018-09-20 | Microsoft Technology Licensing, Llc | Feature flow for video recognition |
CN109376611A (en) * | 2018-09-27 | 2019-02-22 | 方玉明 | A kind of saliency detection method based on 3D convolutional neural networks |
-
2019
- 2019-03-26 CN CN201910230235.2A patent/CN109993096B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180268208A1 (en) * | 2017-03-20 | 2018-09-20 | Microsoft Technology Licensing, Llc | Feature flow for video recognition |
CN108242062A (en) * | 2017-12-27 | 2018-07-03 | 北京纵目安驰智能科技有限公司 | Method for tracking target, system, terminal and medium based on depth characteristic stream |
CN109376611A (en) * | 2018-09-27 | 2019-02-22 | 方玉明 | A kind of saliency detection method based on 3D convolutional neural networks |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110400305A (en) * | 2019-07-26 | 2019-11-01 | 哈尔滨理工大学 | A kind of object detection method based on deep learning |
CN110852199A (en) * | 2019-10-28 | 2020-02-28 | 中国石化销售股份有限公司华南分公司 | Foreground extraction method based on double-frame coding and decoding model |
JP2022551396A (en) * | 2019-11-20 | 2022-12-09 | テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド | Motion recognition method, apparatus, computer program and computer device |
JP7274048B2 (en) | 2019-11-20 | 2023-05-15 | テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド | Motion recognition method, apparatus, computer program and computer device |
US11928893B2 (en) | 2019-11-20 | 2024-03-12 | Tencent Technology (Shenzhen) Company Limited | Action recognition method and apparatus, computer storage medium, and computer device |
CN111144376A (en) * | 2019-12-31 | 2020-05-12 | 华南理工大学 | Video target detection feature extraction method |
CN111144376B (en) * | 2019-12-31 | 2023-12-05 | 华南理工大学 | Video target detection feature extraction method |
CN113673545A (en) * | 2020-05-13 | 2021-11-19 | 华为技术有限公司 | Optical flow estimation method, related device, equipment and computer readable storage medium |
CN112307872A (en) * | 2020-06-12 | 2021-02-02 | 北京京东尚科信息技术有限公司 | Method and device for detecting target object |
CN111860293B (en) * | 2020-07-16 | 2023-12-22 | 中南民族大学 | Remote sensing scene classification method, device, terminal equipment and storage medium |
CN111860293A (en) * | 2020-07-16 | 2020-10-30 | 中南民族大学 | Remote sensing scene classification method and device, terminal equipment and storage medium |
CN111950612A (en) * | 2020-07-30 | 2020-11-17 | 中国科学院大学 | FPN-based weak and small target detection method for fusion factor |
CN112307889A (en) * | 2020-09-22 | 2021-02-02 | 北京航空航天大学 | Face detection algorithm based on small auxiliary network |
CN112307889B (en) * | 2020-09-22 | 2022-07-26 | 北京航空航天大学 | Face detection algorithm based on small auxiliary network |
CN112394356A (en) * | 2020-09-30 | 2021-02-23 | 桂林电子科技大学 | Small-target unmanned aerial vehicle detection system and method based on U-Net |
CN112394356B (en) * | 2020-09-30 | 2024-04-02 | 桂林电子科技大学 | Small target unmanned aerial vehicle detection system and method based on U-Net |
CN111968064A (en) * | 2020-10-22 | 2020-11-20 | 成都睿沿科技有限公司 | Image processing method and device, electronic equipment and storage medium |
CN111968064B (en) * | 2020-10-22 | 2021-01-15 | 成都睿沿科技有限公司 | Image processing method and device, electronic equipment and storage medium |
CN112966581B (en) * | 2021-02-25 | 2022-05-27 | 厦门大学 | Video target detection method based on internal and external semantic aggregation |
CN112966581A (en) * | 2021-02-25 | 2021-06-15 | 厦门大学 | Video target detection method based on internal and external semantic aggregation |
CN113223044A (en) * | 2021-04-21 | 2021-08-06 | 西北工业大学 | Infrared video target detection method combining feature aggregation and attention mechanism |
CN113570608B (en) * | 2021-06-30 | 2023-07-21 | 北京百度网讯科技有限公司 | Target segmentation method and device and electronic equipment |
CN113570608A (en) * | 2021-06-30 | 2021-10-29 | 北京百度网讯科技有限公司 | Target segmentation method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN109993096B (en) | 2022-12-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109993096A (en) | A kind of light stream multilayer frame feature propagation and polymerization towards video object detection | |
CN111784602B (en) | Method for generating countermeasure network for image restoration | |
CN109101975A (en) | Image, semantic dividing method based on full convolutional neural networks | |
CN108710863A (en) | Unmanned plane Scene Semantics dividing method based on deep learning and system | |
CN109903255A (en) | A kind of high spectrum image Super-Resolution method based on 3D convolutional neural networks | |
CN109671023A (en) | A kind of secondary method for reconstructing of face image super-resolution | |
CN110490919A (en) | A kind of depth estimation method of the monocular vision based on deep neural network | |
CN109978807A (en) | A kind of shadow removal method based on production confrontation network | |
CN103248906B (en) | Method and system for acquiring depth map of binocular stereo video sequence | |
CN105657402A (en) | Depth map recovery method | |
CN109727270A (en) | The movement mechanism and analysis of texture method and system of Cardiac Magnetic Resonance Images | |
CN111861906A (en) | Pavement crack image virtual augmentation model establishment and image virtual augmentation method | |
CN111914726B (en) | Pedestrian detection method based on multichannel self-adaptive attention mechanism | |
CN110334719A (en) | The method and system of object image are built in a kind of extraction remote sensing image | |
CN110992366A (en) | Image semantic segmentation method and device and storage medium | |
CN114612937A (en) | Single-mode enhancement-based infrared and visible light fusion pedestrian detection method | |
CN107564007A (en) | The scene cut modification method and system of amalgamation of global information | |
CN110246085A (en) | A kind of single-image super-resolution method | |
CN103226825B (en) | Based on the method for detecting change of remote sensing image of low-rank sparse model | |
CN109658361A (en) | A kind of moving scene super resolution ratio reconstruction method for taking motion estimation error into account | |
Choi | Utilizing unet for the future traffic map prediction task traffic4cast challenge 2020 | |
CN111915589A (en) | Stereo image quality evaluation method based on hole convolution | |
CN105931189A (en) | Video ultra-resolution method and apparatus based on improved ultra-resolution parameterized model | |
CN111179272A (en) | Rapid semantic segmentation method for road scene | |
CN114463237A (en) | Real-time video rain removing method based on global motion compensation and inter-frame time domain correlation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |