CN113592878A - Compact multi-scale video foreground segmentation method - Google Patents

Compact multi-scale video foreground segmentation method Download PDF

Info

Publication number
CN113592878A
CN113592878A CN202110729146.XA CN202110729146A CN113592878A CN 113592878 A CN113592878 A CN 113592878A CN 202110729146 A CN202110729146 A CN 202110729146A CN 113592878 A CN113592878 A CN 113592878A
Authority
CN
China
Prior art keywords
convolution
compact
scale
layer
rfc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110729146.XA
Other languages
Chinese (zh)
Inventor
潘志松
张锦
李阳
潘欣冉
周星宇
贺正芸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN202110729146.XA priority Critical patent/CN113592878A/en
Publication of CN113592878A publication Critical patent/CN113592878A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

A compact multi-scale video foreground segmentation method relates to the technical field of computer vision, and provides a multi-scale compact sampling module in order to improve the coding of a scene airspace feature by a depth network from the perspective of high-level (namely large-scale) and multi-scale feature coding. The module consists of a series of parallel, compact hole convolutions with different receptive fields. The multi-scale compact sampling module is capable of capturing multi-scale features in a compact manner to combat the problem of nuclear degradation. Specifically, the compact hole convolution is elaborately designed into a cascade structure, all input neurons in a receptive field are just covered completely, and no hole or overlap is left, so that the multi-scale compact sampling module provided by the invention can sense more complete multi-scale information in different receptive fields without causing the significant increase of model parameters. Therefore, the multi-scale compact sampling module can avoid the problem of nuclear degradation on the basis of maintaining high running efficiency.

Description

Compact multi-scale video foreground segmentation method
Technical Field
The invention relates to the technical field of computer vision, in particular to a pixel-level classification task in the field of computer vision, namely video foreground segmentation.
Background
Video foreground segmentation is a basic pixel-level classification task in the field of computer vision. Given a scene S, the foreground segmentation algorithm learns that the representation of S separates background and foreground moving objects in the video sequence. The extracted foreground may provide a good compromise between detection quality and computation time for complex visual applications. Therefore, the video foreground segmentation is used as a preprocessing step of a high-level task, and has wide application value in the real world, including anomaly detection (such as remnant detection, product defect detection and fire discovery), vehicle statistics tracking and accident detection, ship and ocean traffic monitoring, animal behavior visual observation, natural environment visual monitoring (such as floater detection), human behavior analysis, background replacement and the like. Since the precision of the preprocessing step has a great influence on the performance and efficiency of subsequent tasks, it is important to learn an effective scene representation to extract an accurate foreground target.
Video foreground segmentation requires extraction of foreground moving objects of varying sizes from the background. And when the foreground object gets close to the shot from far to near, the size of the foreground object in the scene picture is changed from small to large. Therefore, the robust method needs to have accurate segmentation effect on scene targets with different scales. The multi-scale spatial domain characteristic representation of the coding scene is important content of foreground segmentation network design, and is beneficial to comprehensive inference of the model according to context information of different scales. In the process of coding multi-scale spatial domain features, the difficulty lies in coding large-scale spatial domain features.
Methods based on Full Convolution Networks (FCNs) Networks increase the Receptive field of neurons (receiving Fields) by using downsampling layers (convolution or pooling operations with step size greater than 1) to encode large-scale spatial features. The large-scale context facilitates semantic inferences from the overall appearance of the object, avoiding "blind-man-like" type local inferences. However, the increasing number of down-sampling layers causes a loss of spatial information, thereby preventing the spatial information from being recovered by the decoding process. However, decoding at an early stage of coding with higher resolution is not a good strategy, since it does not allow better inference with more level semantics. In summary, network design needs to balance between preserving complete spatial information and encoding higher-level features.
In recent years, hole Convolution (Atrou Convolution) has been used as an effective strategy to solve the contradiction between high-resolution and large-scale feature coding. Since the kernel of the hole convolution increases the field of view by inserting "holes" (holes) between the parameters through the dilation strategy, the hole convolution can also perceive large-scale contextual information without over-downsampling. However, hole convolution still has two limitations. One is the nuclear degeneration problem. As the dilation rate increases, the samples of the convolution kernel within the receptive field will become increasingly sparse, resulting in a degradation of the convolution kernel performance. The second is the single scale problem. For the feature mapping generated by the hole convolution, since the information of all neurons in the mapping is derived from the same receptive field, the semantic coding process can be considered as being limited to a single scale, and foreground objects in a scene often exist in a multi-scale manner.
In order to obtain a multi-scale scene representation, a feature pyramid strategy (for example, ASPP) adopts a parallel connection mode of multiple groups of parallel hole convolutions to extract multi-scale features, but the feature pyramid strategy is also limited by the problem of kernel degradation because the interior of the feature pyramid strategy usually contains hole convolutions with large expansion rates.
Disclosure of Invention
The invention aims to provide a compact multi-scale video foreground segmentation method, which designs a new convolution module and provides two strategies for constructing the convolution module, namely a Zoom-in focusing strategy (Zoom-out) and a Zoom-out focusing strategy (Zoom-in); the multi-scale features can be compactly encoded to solve the problem of nuclear degradation and improve the segmentation precision.
A compact multi-scale video foreground segmentation method, let x [ i ] and y [ i ] represent input signal and output signal respectively, the hole convolution operation is defined as follows:
Figure RE-GDA0003279083270000021
wherein, f [ k ]idx]Is a filter of length K, the expansion ratio r representing the corresponding sampling step; when r is 1, the hole convolution degenerates to the standard convolution operation; when a kernel of k × k 2D hole convolution can be performed on input feature x with a size of ka×kaWhen the region of (2) is sampled/convolved, the field of the hole convolution is called ka
ka=k+(r-1)·(k-1) (3-2)
A larger expansion rate means a larger receptive field; in order to obtain a wider receptive field and richer context information, a plurality of hole convolutions are applied in parallel or in cascade to the advanced feature map which has undergone a series of convolution and down-sampling operations;
suppose CACnThe expression is composed of n cascaded void convolution layers, and the convolution kernel size and expansion ratio of each layer are respectively { k1,k2,…,knAnd { r }1,r2,…,rn};CACnThe information for any neuron in the output signature map is derived from the input signature map within its corresponding receptive field, and there is neither information omission nor "oversampling";
the reception fields of the compact hole convolution layers are recorded as RFC, and under the condition of no information omission and 'overlapped sampling', the relation between the RFC and each layer convolution kernel satisfies the formulas 3-3 and 3-4.
RFC=k1k2…kn (3-3)
Figure RE-GDA0003279083270000031
According to the size of RFC, for CACnThe design is performed to determine the number of convolution layers n, the size k of each layer of convolution kernel, and its expansion rate r.
When the RFC is less than or equal to 5, the single-layer convolution can meet the requirement of the receptive field. In this case, k1=RFC,r1At this point, the first compact hole convolution layer degenerates to a standard convolution layer. With the increase of RFC, the compact hole convolution layer cannot avoid using a multi-layer cascade form; in order to ensure that the output neuron is positioned at the central position of the receptive field, the size of the nucleus is always selected to be odd.
When 5 is<When RFC is less than or equal to 25, 2 satisfied k needs to be used1×k2Cascade-connecting the convolution layers of RFC;
1)r1=k2,r2since the process uses decreasing dilation rates, the area experienced by the convolution is first "squeezed" to a k since the information mapped in the RFC range of the input features is first "squeezed" to a gradual contraction, i.e., the "zoom-out-focus" strategy2×k2Then further focusing to the central neuron;
2)r1=1,r2=k1as the process uses increasing dilation rates, the area felt by the neuron is gradually expanding, i.e., a "zoom-in-focus" strategy; first collect k1×k1Local information in the region, then k at different positions in the whole RFC1×k1The information of the region is concentrated to the central neuron.
When RFC is more than 25, the CAC is constructed in a recursion modenWherein n is more than or equal to 3; CACnCan be regarded as a convolution CAC consisting of an nth (or 1 st) void convolution layer and a front n-1 (or rear n-1) layersn-1Two parts are formed; wherein, CACn-1Is considered as a common hole convolution with a convolution kernel of kn-1=k1k2…kn-1Expansion ratio of r n-11 is ═ 1; in this case, the nth (or 1 st) hole convolution layer and the front n-1 (or rear n-1) layers of convolution CACn-1Based on "zoom in focus" (or "zoom out focus")Coke ") strategy is cascaded; two-stage void convolution cascade, CACnReception field of RFC ═ k1k2…kn(ii) a Based on the recursion mode, the CAC formed by convolution cascade of n-level holesnCompact sampling of features in the receptive field is achieved as a whole; will CACnDenoted CAC (k)1,k2,…,kn;c1,c2,…,cn),r=(r1,r2,…,rn) Wherein k isi、ciAnd riRespectively representing the convolution kernel size, the output channel number and the expansion rate of the ith layer.
Preferably, the compact hole convolution layer is expanded to a multi-scale space by adopting a multi-scale compact sampling module, and the multi-scale compact sampling module consists of 5 groups of parallel compact hole convolutions with different receptive fields; compact hole convolution divided into CompactASPPiAnd CompactASPPo;CompactASPPiBased on a reduction focusing strategy, the expansion rates of the convolution of the compact holes are distributed in a decreasing mode; CompactASPPoThe expansion rate of the compact hole convolution is arranged in an incremental manner based on a magnifying focusing strategy.
Preferably, the number of channels of the input feature mapping of the compact hole convolution is large, the number of output feature channels is small, and in order to control the size of the module to ensure the high efficiency of calculation, the small-scale convolution kernel is preferentially arranged at the bottom layer of the compact hole convolution; the number of output channels of each convolution layer in the compact hole convolution layer is consistent with that of the last convolution layer.
The invention provides a more effective module for improving the coding of a depth network on the spatial domain characteristics of a scene from the view point of high-level (namely large-scale) and multi-scale characteristic coding: the multi-scale compact sampling module CompactASPP. The module consists of a series of parallel Compact Aperture Convolutions (CACs) with different receptive fields. CompactASPP is able to capture multi-scale features in a compact manner to cope with nuclear degeneration issues. Specifically, the convolution module is designed to be a cascade structure, all input neurons in the receptive field are covered completely, and no hole or overlap is left, and the CompactASPP provided by the invention can sense more complete multi-scale information in different receptive fields without causing the model parameters to increase significantly. Thus, CompactASPP can avoid the problem of nuclear degradation while maintaining operational efficiency.
A Fast X-Net network is designed based on CompactASPP, and X-Net is improved. The multiple-input and multiple-output architecture adopted by the X-Net can effectively utilize time domain information, but the defect of high computational complexity is caused by extracting multi-scale spatial domain features based on an image pyramid strategy. In order to pursue a more efficient segmentation speed, the compact ASPP module is embedded into an X-Net framework, and an original image pyramid strategy is removed, so that the processing speed is improved by 63.6%. Also, Fast X-Net has higher segmentation accuracy.
The invention introduces the concept of compact sampling to design a new convolution module, and provides two strategies for constructing the new convolution module, namely a Zoom-out strategy and a Zoom-in strategy;
the invention designs a new multi-scale module, CompactASPP, based on the new convolution module, and the module can compactly encode multi-scale features to solve the problem of nuclear degradation and improve the segmentation precision;
the invention provides a Fast X-Net method based on CompactASPP, which reaches a leading level on three data sets of CDnet 2014, SBI2015 and UCSD.
Drawings
Fig. 1a is a sample schematic of a convolution standard convolution with a dilation rate r-1.
Fig. 1b is a sample schematic of the compact hole convolution at a dilation rate r of 4.
FIG. 2 is a schematic diagram of a pyramid pooling of void space.
FIG. 3 is a schematic illustration of the "checkerboard effect" and "overlapping sampling" phenomena of the cascade hole convolution.
FIG. 4 is a compact sampling schematic of the cascaded hole convolution.
FIG. 5 is based on CACn-1And nth order convolutional layer to construct CACnSchematic diagram.
FIG. 6 is a schematic diagram of the construction of CompactASPP.
FIG. 7 is a Fast X-Net framework diagram.
Fig. 8 is a visualization of a 20-frame experiment on the CDnet 2014 data set.
Fig. 9 is a visualization of multi-scale modules on a CDnet 2014 test frame.
FIG. 10 is a visualization of Fast X-Net in a Toscana scenario.
Detailed Description
The compact hole convolution is obtained by utilizing the hole convolution, and a compact multi-scale feature fusion module, namely compact ASPP, is provided on the basis of the compact hole convolution.
Convolution of holes
The hole Convolution (AC), also known as dilation Convolution (Dilate Convolution), was originally developed in the wavelet decomposition algorithm Atrous. The DeepLab v1 deploys the hole convolution in the FCN framework for the first time, so that the purposes of increasing the receptive field and capturing the context dependence in a wider range are achieved on the premise of reducing down-sampling and maintaining the high resolution of feature mapping. The following description will be given by taking one-dimensional hole convolution as an example. Let x [ i ] and y [ i ] denote the input signal and the output signal, respectively, the hole convolution operation can be defined as follows:
Figure RE-GDA0003279083270000051
wherein, f [ k ]idx]Is a filter of length k and the expansion ratio r represents the corresponding sampling step. When r is 1, the hole convolution degenerates to the standard convolution operation. When a kernel of k × k 2D hole convolution can be performed on input feature x with a size of ka×kaWhen the region of (2) is sampled/convolved, the field of the hole convolution is called ka. It can be easily found from the formula 3-2 that a larger expansion ratio means a larger receptive field. To obtain a wider receptive field and richer contextual information, multiple hole convolutions are typically applied in parallel or in cascade to a set of convolutions that have been subjected toAnd advanced feature mapping for downsampling operations.
ka=k+(r-1)·(k-1) (3-2)
In the parallel mode, the multiple hole convolution layers sample the same input feature map at different respective expansion rates to obtain the receptive fields of different scales. This pattern is commonly referred to as void space Pyramid Pooling (ASPP). However, since there is usually a hollow core with an excessively large expansion rate (e.g., r-6 in fig. 2)&12&18) And thus faces the problem of nuclear degeneration. Specifically, information about neuron p in layer l is derived from layer l-1 with p as the center and size ka×kaThe neuron in the region (the number of neurons in the region is k)a×ka) However, the number of neurons actually used is only k × k. For example, when k is 3 and r is 6, only the neurons of 9/169 in the receptive field region participate in the calculation. Correlation studies show that as the expansion rate r increases, the sampling of the hole kernel (e.g. 3 × 3) in the receptive field becomes more and more sparse, and the characteristic coding capability thereof becomes weaker and weaker until the regression is 1 × 1 convolution kernel. That is, the weights of the cores at the remaining positions except the center position are close to 0.
As shown in fig. 3, in the cascade mode, a plurality of void convolution layers with smaller expansion ratio generate a large receptive field effect in a stacked manner, thereby alleviating the nuclear degeneration problem. However, when all layers use the same dilation rate r, the top-most output neurons are sampled in a "checkerboard effect" manner, losing most of the information. FIG. 3a is the "checkerboard effect" of the cascaded hole convolution, and FIG. 3b is the "oversampling" phenomenon of the cascaded hole convolution. Fig. 3a shows the "checkerboard effect" produced by the cascade of two void convolutional layers, where k is 3 and r is 2. Specifically, the information of the output neurons of the l-th layer convolution at the star positions is derived from the information of the input features mapped at the 9 dark circle positions. Similarly, the forward recursion shows that the information of the output neurons at the positions of the 9 dark circles in the layer l-1 convolution is derived from the information of the input features mapped at the positions of the dark lattices, and the deeper the color of the dark lattices, the more the sampling times. Considering 2-stage cascade convolution as a wholeThe information of the output neurons at the star positions comes from the input information at the dark grid positions. As an improvement, Hybrid Atom Convolution (HAC) employs a gradually increasing expansion ratio during the cascade. In this case, the output neuron at the upper layer can cover a square area (see fig. 3b) to eliminate the "hole" in the "checkerboard effect". However, the overlapping sampling phenomenon existing in the HAC can limit further improvement of the receptive field, and also bring redundant calculation. Can overlapping samples be eliminated and the field of view further expanded? In order to explore this problem, the present invention has been studied from the viewpoint of reasonably configuring the expansion rate of the cascade hole Convolution, and proposes a Convolution having a Compact sampling characteristic and named Compact Atom Convolution (CAC). FIG. 3 a: r isl-1=rl=2,b:rl-1=1,rl2; stars represent output neurons of the l layer convolution, and information of the output neurons is derived from a red circle, namely the output neurons of the l-1 layer convolution; the information for the dark circle neurons is derived from the l-1 th layer of convolved input neurons at the dark grid positions, and deeper colors indicate more samples.
Convolution of tight holes
Because of the correlation between spatial neighborhoods, foreground/background semantic inference can benefit from rich context dependencies. In theory, standard convolutions can be directly utilized to compactly encode different scales of context. However, in the case of a large receptive field, the direct use of a large convolution kernel to extract long range dependence is not a judicious choice, as it is likely to result in severe overfitting. To this end, this section uses a small scale convolution kernel based (no more than 5 x 5) to construct CACs to encode different scales of context dependence.
Suppose CACnThe expression is composed of n cascaded void convolution layers, and the convolution kernel size and expansion ratio of each layer are respectively { k1,k2,…,knAnd { r }1,r2,…,rn}. Ideally, the present invention contemplates CACnInformation for any neuron in the output signature map is derived from inputs within its corresponding receptive fieldFeature mapping and neither information omission ("holes") nor "overlapping sampling". For convenience of description, the receptive Field of CAC is denoted as RFC (received Field of CAC). It can be readily seen that the RFC relationship to the layers of the convolution kernel satisfies equations 3-3 and 3-4 without "holes" and "overlapping samples". Following according to the size of RFC, for CACnThe design is performed to determine the number of convolution layers n, the size k of each layer of convolution kernel, and its expansion rate r.
RFC=k1k2…kn (3-3)
Figure RE-GDA0003279083270000071
(1) When the RFC is less than or equal to 5, the single-layer convolution can meet the requirement of the receptive field. In this case, k1=RFC,r1CAC at this time when 11Degenerates to the standard convolution layer. As RFC increases, CAC will inevitably use a form of multi-layer concatenation. In order to ensure that the output neuron is positioned at the central position of the receptive field, the size of the nucleus is always selected to be odd;
description of the variables of the Table
(2) When 5 is<When RFC is less than or equal to 25, 2 satisfied k needs to be used1×k2Cascade of convolutional layers of RFC. Specifically, the invention provides two strategies to construct compact hole convolution: 1) r is1=k2,r2Since this process uses a decreasing dilation rate, the region felt by the convolution is shrinking gradually, so it is named the "Zoom-in" strategy. Specifically, information that is input feature mapped within RFC is first "squeezed" to one k2×k2Then further focused to the central neuron. The following is a detailed description with reference to fig. 4 a: the input features in the receptive field (9 x 9) are first compressed to dark circle positions by the layer l-1 convolution and then focused to star positions by the layer l convolution. 2) r is1= 1,r2=k1Since this process uses an increasing dilation rate, the area felt by the neuron is gradually expanding, hence the name "zoom in focus" strategy(Zoom-out). The strategy first gathers k1×k1Local information in the region, then k at different positions in the whole RFC1×k1The information of the region is concentrated to the central neuron.
(3) When RFC is more than 25, the CAC is constructed in a recursion modenWherein n is more than or equal to 3; CACnCan be regarded as a convolution CAC consisting of an nth (or 1 st) void convolution layer and a front n-1 (or rear n-1) layersn-1Two parts are formed; wherein, cAcn-1Is considered as a common hole convolution with a convolution kernel of kn-1=k1k2…kn-1Expansion ratio of r n-11 is ═ 1; in this case, the nth (or 1 st) hole convolution layer and the front n-1 (or rear n-1) layers of convolution CACn-1Cascading is based on a "zoom-in focus" (or "zoom-out focus") strategy (see fig. 5). Similar to two-stage hole convolution cascade, CACnReception field of RFC ═ k1k2…kn. Based on the recursion mode, the CAC formed by convolution cascade of n-level holesnCompact sampling of features within the receptive field is achieved as a whole. In order, the invention relates to CACnDenoted CAC (k)1,k2,…,kn;c1,c2,…,cn),r=(r1,r2,…,rn). Wherein k isi、ciAnd riRespectively representing the convolution kernel size, the output channel number and the expansion rate of the ith layer. It is emphasized that a Batch Normalization layer (BN) is additionally added between convolutional layers to reduce covariance offset and reduce training difficulty.
Because the value of the convolution kernel k faces certain constraint (formula 3-4), the RFC of the CAC is correspondingly restricted. For example, one RFC cannot be designed0CAC of 13. In this case, the approximate receptive field may be selected for substitution, e.g., RFC1=3×4≈RFC0I.e. k1=3,k2=4。
2.3 CompactASPP
In order to overcome the universality in multi-scale modules such as ASPP, FPM and the likeThe invention provides a compact sampling module, namely compact ASPP (compact atomic Spatial Pyramid sampling) by expanding CAC (computer aided regression) to a multi-scale space. CompactASPP consists of 5 parallel sets of CACs with different receptive fields. The CompactASPP is further classified into CompactASPP according to the information aggregation modeiAnd CompactASPPo。CompactASPPiBased on a demagnifying focusing strategy, in which the expansion rates of the CACs are all distributed in a decreasing manner (see r in FIG. 6)i);CompactASPPoBased on the magnification focusing strategy, the expansion ratios of the CACs are arranged in an increasing manner (see r in FIG. 6)o). For fair comparison, the invention determines the reception fields of the CACs in the CompactASPP according to the multi-scale reception field of the FPM, so that the reception fields of the CACs and the reception fields of the CACs correspond. As shown in fig. 6, 5 sets of CAC produced 5 different scale features that were aggregated along the channel dimension, fed into the BN and 2D Dropout layers to increase the generalization capability.
And (5) parameter control principle. Considering that the number of channels of the input signature map of the CACs is generally large (e.g. 512) and the number of channels of the output signature thereof is generally small (e.g. 64), in order to control the module size to ensure the efficiency of the calculation, the CACs need to be designed according to the following 2 principles: (1) the small scale convolution kernel is preferentially arranged at the bottom layer of the CACs; (2) the number of output channels of each layer of convolution in the CAC is consistent with that of the last layer of convolution. For example, when the input/output channels are 512/64, RFC 15, respectively, CAC is designed as a cascade of 2-layer hole convolutions. According to the above principle, the number of output channels of each layer of convolution is 64, the size of the bottom layer convolution kernel is 3 × 3, and the size of the top layer convolution kernel is 5 × 5, which is denoted as CAC (3; 5; 64; 64). The parameter value is about 388k (512 × 3 × 3 × 64+64 × 5 × 5 × 64), which is much smaller than the parameter value of the standard convolution of the same field, about 7M (512 × 15 × 15 × 64).
3 Fast X-Net
As shown in fig. 7, Fast X-Net is also an X-type architecture consisting of an encoding sub-network, a decoding sub-network, and a blending sub-network to integrate time domain features. The network basic unit includes: a core of 3 × 3 convolutional layers (conv), a core of 2 × 2 max pooling layers (max boosting), a core of 1 × 1, 3 × 3, and 5 × 5 anti-convolutional layers (Tconv), a random discard layer (dropout), relu and sigmoid activation functions, and a CompactASPP multi-scale feature coding module.
Fast X-Net removes the image pyramid strategy used by X-Net coding sub-network and embeds CompactASPP on top of the coding sub-network. In this case, the encoding subnetwork is operated once, so that the multi-scale spatial domain features can be efficiently extracted from a single image, and the image pyramid strategy requires the encoder to be operated for multiple times to complete the same task. As shown in FIG. 7, the coding sub-network of Fast X-Net (including CompactASPP module) is composed of 2 twin branches with the same structure and shared weight. They can extract features with the same pattern from two similar frames. The multi-scale feature representations extracted from two consecutive frames are then fed into the merging sub-network after aggregation along the channel dimension.
A converged sub-network is a network of a single-stream structure. The method fuses multi-scale spatial domain features extracted from 2 frames to realize time domain-spatial domain feature coding. And the generated space-time multi-scale features enter a decoding sub-network for decoding.
Decoding sub-network consists of two branches with the same structure but independent of each other. Each branch produces one foreground mask at a time (right view in fig. 7). It should be noted that Fast X-Net also eliminates the cross-layer connection between the encoder and decoder of X-Net to further improve computational efficiency.
Analysis of experiments
In order to verify the performance of the multi-scale feature coding module CompactASPP, the invention performed experiments on multiple data sets. According to the invention, experimental settings are introduced, then an ablation experiment is carried out based on a CDnet 2014 data set, the effectiveness and the advancement of the algorithm are verified, and finally a supplementary experiment is further carried out on an SBI2015 data set.
Experimental setup
And (5) training and optimizing the model. The invention develops end-to-end training based on the Fast X-Net network shown in FIG. 7. To exploit the high level of semantic knowledge and improve training efficiency, the encoding sub-network is initialized with VGG-16 pre-trained on ImageNet. The experiment adopts a Keras framework taking Tensorflow as a rear end as a deep learning platform and carries out model optimization based on SFL. The parameters are updated using the RMSProp optimizer, where epsilon and initial learning rate are set to 1e-8 and 1e-4, respectively, and batch size is set to 1. The maximum number of training cycles per scene is set to 60 rounds (epochs) and the Early Stop (Early Stop) threshold is set to 10. It is emphasized that no gradient back propagation is performed for the no region of interest (NON-ROI) and the uncertain boundary regions.
And selecting a training sample. For the CDnet 2014 dataset, the present invention performs model performance evaluations at both settings of 50-frame sample (m-50) and 20-frame sample (m-20). Wherein, the data provided by FgSegNet is adopted for 50-frame samples, 80% of the samples are randomly selected to construct a training set (40 frames), 20% of the samples are used for model verification (10 frames), and the rest samples in the whole video are used as test samples. It is emphasized that the manual random selection of training data (and verification data) from the entire video sequence is a common practice in academia. However, there is some unreasonable in this "taking the entire video sequence as the sampling domain": a large number of foreground instances (instances) in the test sample may "appear" in the training sample, which may result in an overestimation of the model performance. In contrast, in actual deployment, foreground instances in the video are typically "non-existent" instances, in which case the model performance is naturally compromised. To avoid an "over-estimation" of the model performance, the present invention considers a more difficult setup in a 20-frame experiment: and selecting a sub-sequence close to the starting position or the ending position as a sampling domain as much as possible from the whole video sequence. In order to reduce the influence of human factors, sampling is performed at equal intervals. For example, the complete video sequence of the highway scene is [470, 1700] (the rest sequences are not considered because the data set does not provide a group channel), and the 20-frame experiment is to select samples at 12 intervals between [1424, 1436, 1448, …, 1700], and the specific sample sequence is [1424, 1436, 1448, …, 1652 ].
And (5) constructing a training set. Unlike the single-frame input model, Fast X-Net as a "paired input" network requires selection of paired frames to construct the training set. Given m frames, the invention constructs a training set based on the following strategy:
a. the m frames are sorted according to their order from small to large and renumbered as 1, 2, …, m.
b. To generate "frame pairs", any frame is matched with its neighbors within a window of length 2 × interval, and all "frame pairs" that satisfy the condition constitute the training set.
For the case of fewer labeled samples, the increment sets interval to 6 to increase training set size. In this case, the number of "frame pairs" that the first three frames and the last three frames can match is 4, 5, 6, and the number of "frame pairs" that the intermediate frames can match is 7. When m is 20, the training set size is 128 "frame pairs" (4+5+6+14 × 7+6+5+ 4).
Results and analysis of the experiments
CDnet 2014 data set experiments. As shown in Table 1, the mean recall ratio (Re) and the mean accuracy ratio (Pr) of Fast X-Net in a 50-frame experiment both exceed 97%, and the comprehensive performance (F value) reaches 0.976. Wherein, the average F value of the BL scene is the highest in all categories, and reaches 0.995; while the average F value for LF-like scenes is lowest, but also exceeds 0.9. The main reason is that there is a very challenging scene in LF, where foreground objects are very few, the scale is very small, and the dynamically changing background and foreground features have high similarity, which all bring about serious interference. It is emphasized that the model performance evaluation index given in table 3-1 was obtained based on the test samples only, i.e., 50 frames for training and validation, and was not used for performance evaluation.
TABLE 1 Fast X-Net test results based on 50-frame samples on CDnet 2014 data set
Figure RE-GDA0003279083270000111
Compared to advanced algorithms. The present invention compares Fast X-Net with 7 other high performance methods. As shown in table 2, X-Net, FgSegNet _ M, and Cascade CNN employ an image pyramid strategy to extract multi-scale information; FgSegNet _ S and FgSegNet _ v2 implement multi-scale feature coding based on a feature pyramid module similar to ASPP; 3D SegNet utilizes 3D spatio-temporal filtering to extract spatio-temporal features; IUTIS-5 is a traditional unsupervised method based on an integration technology, which integrates the output results of a plurality of high-performance traditional background modeling methods through a genetic algorithm. It is emphasized that 3D SegNet uses 70% of the samples as the training set, and the rest of the supervised methods use the same 50 frame samples as the training set.
TABLE 2 comparison of Fast X-Net and other high Performance method F value Performance
Figure RE-GDA0003279083270000121
Wherein Cascade CNN, FgSegNet _ M, X-Net (outlets) use image pyramid strategy, FgSegNet _ S, FgSegNet _ v2, Fast X-Net (outlets) use feature pyramid strategy.
For 50-frame experiments, the supervised approach based on the deep network is significantly better than the traditional unsupervised approach based on low-level features (IUTIS-5), especially in the extremely challenging scenarios of PTZ, NV, LF, etc. In addition, Fast X-Net proposed by the present invention outperforms other existing methods with great advantage. It is emphasized that FgSegNet _ S is the baseline comparison method of the present invention, which employs a feature pyramid strategy to model multi-scale features, as well as Fast X-Net, and the multi-scale receptive field is substantially the same. The performance advantage of Fast X-Net was further expanded for the 20-frame experiment. As shown in Table 3, the F values for Fast X-Net were increased by 2.3%, and 1.5% compared to X-Net, FgSegNet _ S, and FgSegNet _ v2, respectively. In the aspect of speed, the foreground segmentation speed of Fast X-Net based on compact ASPP is obviously superior to that of an image pyramid method, and is improved by 63.6% and 50% compared with X-Net and FgSegNet _ M.
TABLE 3 comparison of the comprehensive Performance of Fast X-Net and 4 high Performance methods (20-Frames) (speed test based on Invitta 1080Ti GPU and video resolution 320X 240)
Figure RE-GDA0003279083270000131
And (5) visualization effect. In order to more intuitively demonstrate the performance of Fast X-Net and its module CompactASPP, the present invention provides some qualitative visualization results. As shown in fig. 8, the method of the present invention has good robustness to foreground scale variation, and shows high recall rate under different scale foreground conditions of large, medium and small. In nightv. scenarios, Fast X-Net is better able to overcome the challenges of illumination mutation than FgSegNet _ S, and thus produces fewer false positive samples, since the nuclear degradation problem is avoided. Wherein a: input frame, b: group truths, c: fast X-Net, d: FgSegNet _ v2, e: FgSegNet _ S, f: and (4) X-Net.
CompactASPP ablation experiments. Two sets of comparative experiments were designed to further verify the performance of the CompactASPP module. In a first comparative experiment, the present invention removed the entire multi-scale module from Fast X-Net and referred to this modified network as Fast X-NetbaselineThis experiment was used to test the performance of the network without the multi-scale model. In a second comparative experiment, the invention replaces the CompactASPP module with the ASPP module (both having the same size receptive field), and the modified network is called Fast X-Netaspp. As shown in tables 3-4, the ASPP module produced an average F value of 0.974 in a 50 frame experiment. This is in contrast to Fast X-Net, which does not use multiscalenThe improvement is 0.5%, and the CompactASPP can bring about a further improvement of 0.2% on this basis. The corresponding improvement is more obvious in the 20-frame experiment. The CompactASPP module gives a gain of 0.4% to the network over the ASPP module. The visualization results shown in fig. 9 indicate that the tightly sampled pattern helps correct otherwise misclassified positive samples. Wherein a: input frame, b: group route, c: fast X-Netbaseline, d:Fast X-Netaspp,e:Fast X-Neti.. In addition, CompactASPPiAnd CompactASPPoSimilar performance is shown, which means that multi-scale features can be effectively extracted by using any one of two compact sampling strategies proposed by the invention. It should be noted that, in ambiguous cases, the experimental results of the invention regarding Fast X-Net are mainly based on CompactASPPi
Tables 3-4 multiscale Module comparison experiments
Figure RE-GDA0003279083270000141
SBI2015 data set experiments. To further verify the advancement of the Fast X-Net algorithm, the present invention performed supplementary experiments on both SBI2015 and UCSD datasets. SBI2015 contains 14 segments of video and provides a group route for the entire video. For ease of comparison, the same training setup and training samples as FgSegNet were used. Specifically, 16% of all samples were used for model training, 4% as model validation, and the remaining 80% constructed the test set. As shown in tables 3-5, Fast X-Net defeated the other 4 high performance algorithms and achieved the best performance. It is worth emphasizing that FgSegNet _ v2 already obtained a very high F value (0.984), which is a further improvement of 0.5% by the process of the invention. In all scenarios, Toscana's performance was lowest (0.962). Mainly because the video only comprises 6 frames of images, and Fast X-Net only utilizes 2 frames of the images for model training. Because the number of training samples is too small, the model is inevitably overfitting. Nevertheless, 0.962 is still an acceptable performance, which also shows the robustness of Fast X-Net in small sample learning cases.
Tables 3-5 Fast X-Net test results on SBI2015 dataset and comparison to high Performance methods
Figure RE-GDA0003279083270000142
Figure RE-GDA0003279083270000151
The present invention shows all training frames (column a) and a typical test frame (column b) in fig. 10. Wherein a: all training frames, b: typical test frame, c: group truths, d: and (4) Mas in the foreground. A large number of false positive samples appear in the red circle of the d column, because all training frames in the area of the a column are foreground, and thus the model "does not see" the background information corresponding to the area, and therefore misjudgment occurs.

Claims (10)

1. A compact multi-scale video foreground segmentation method is characterized by comprising the following steps:
let x [ i ] and y [ i ] denote the input signal and the output signal, respectively, and the hole convolution operation is defined as follows:
Figure FDA0003138672880000011
wherein, f [ k ]idx]Is a filter of length K, the expansion ratio r representing the corresponding sampling step; when r is 1, the hole convolution degenerates to the standard convolution operation; when a kernel of k × k 2D hole convolution can be performed on input feature x with a size of ka×kaWhen the region of (2) is sampled/convolved, the field of the hole convolution is called ka
ka=k+(r-1)·(k-1) (32)
A larger expansion rate means a larger receptive field; in order to obtain a wider receptive field and richer context information, a plurality of hole convolutions are applied in parallel or in cascade to the advanced feature map which has undergone a series of convolution and down-sampling operations;
suppose CACnThe expression is composed of n cascaded void convolution layers, and the convolution kernel size and expansion ratio of each layer are respectively { k1,k2,…,knAnd { r }1,r2,…,rn};CACnThe information for any neuron in the output signature map is derived from the input signature map within its corresponding receptive field, and there is neither information omission nor "oversampling";
the reception fields of the compact hole convolution layers are recorded as RFC, and under the condition of no information omission and 'overlapped sampling', the relation between the RFC and each layer convolution kernel satisfies the formulas 3-3 and 3-4.
RFC=k1k2…kn (33)
Figure FDA0003138672880000012
According to the size of RFC, for CACnThe design is performed to determine the number of convolution layers n, the size k of each layer of convolution kernel, and its expansion rate r.
2. The compact multi-scale video foreground segmentation method of claim 1, characterized by: when the RFC is less than or equal to 5, the single-layer convolution can meet the requirement of the receptive field. In this case, k1=RFC,r1At this time, the compact hole convolution layer degenerates to a standard convolution layer. With the increase of RFC, the compact hole convolution layer cannot avoid using a multi-layer cascade form; in order to ensure that the output neuron is positioned at the central position of the receptive field, the size of the nucleus is always selected to be odd.
3. The compact multi-scale video foreground segmentation method of claim 1, characterized by: when RFC is more than 5 and less than or equal to 25, 2 satisfying k needs to be used1×k2Cascade-connecting the convolution layers of RFC;
1)r1=k2,r2since the process uses decreasing dilation rates, the area experienced by the convolution is first "squeezed" to a k since the information mapped in the RFC range of the input features is first "squeezed" to a gradual contraction, i.e., the "zoom-out-focus" strategy2×k2Then further focusing to the central neuron;
2)r1=1,r2=k1as the process uses increasing dilation rates, the area felt by the neuron is gradually expanding, i.e., a "zoom-in-focus" strategy; first collect k1×k1Local information in the region, then k at different positions in the whole RFC1×k1Information of the regionConcentrated to the central neuron.
4. The compact multi-scale video foreground segmentation method of claim 3, characterized by: when RFC is larger than 25, constructing CACN in a recursion mode, wherein n is larger than or equal to 3; CACnCan be regarded as a convolution CAC consisting of an n-th or 1-th void convolution layer and a front n-1 layer or a rear n-1 layern-1Two parts are formed; wherein, CACn-1Is considered as a common hole convolution with a convolution kernel of kn-1=k1k2…kn-1Expansion ratio of rn-11 is ═ 1; in this case, the nth or 1 st hole convolutional layer and the front or rear n-1 layers of convolutional CACn-1Cascading based on a 'zoom-in focus' or 'zoom-out focus' strategy; two-stage void convolution cascade, CACnReception field of RFC ═ k1k2…kn(ii) a Based on the recursion mode, the CAC formed by convolution cascade of n-level holesnCompact sampling of features in the receptive field is achieved as a whole; will CACnDenoted CAC (k)1,k2,…,kn;c1,c2,…,cn),r=(r1,r2,…,rn) Wherein k isi、ciAnd riRespectively representing the convolution kernel size, the output channel number and the expansion rate of the ith layer.
5. The compact multi-scale video foreground segmentation method of claim 4, characterized by: expanding the compact hole convolution layer to a multi-scale space and adopting a multi-scale compact sampling module, wherein the multi-scale compact sampling module consists of 5 groups of parallel compact hole convolutions with different receptive fields; compact hole convolution divided into CompactASPPiAnd CompactASPPo;CompactASPPiBased on a reduction focusing strategy, the expansion rates of the convolution of the compact holes are distributed in a decreasing mode; CompactASPPoThe expansion rate of the compact hole convolution is arranged in an incremental manner based on a magnifying focusing strategy.
6. The compact multi-scale video foreground segmentation method of claim 5, characterized by: the input characteristic mapping channels of the CACs are large in number, the output characteristic channels are small in number, and in order to control the size of a module and ensure the high efficiency of calculation, the small-scale convolution kernel is preferentially arranged at the bottom layer of the compact hole convolution; the number of output channels of each convolution layer in the compact hole convolution layer is consistent with that of the last convolution layer.
7. The compact multi-scale video foreground segmentation method of claim 1, characterized by: the Fast X-Net integrates time domain characteristics by an X-type framework formed by a coding sub-network, a decoding sub-network and a fusion sub-network, and a network basic unit comprises: a kernel of 3 × 3 convolutional layers, a kernel of 2 × 2 max pooling layers, a kernel of 1 × 1, 3 × 3, and 5 × 5 deconvolution layers (Tconv), random discard layers, relu and sigmoid activation functions, and a multi-scale feature coding module.
8. The compact multi-scale video foreground segmentation method of claim 7, characterized by: removing an image pyramid strategy used by an X-Net coding sub-network by Fast X-Net, and embedding a multi-scale compact sampling module into the top of the coding sub-network; the multi-scale airspace features can be efficiently extracted from a single image only by operating the coding subnetwork once; the coding sub-network of Fast X-Net is composed of 2 twin branches with the same structure and shared weight, the features with the same mode are extracted from two similar frames, and the multi-scale features extracted from two continuous frames are sent into a fusion sub-network after being aggregated along the channel dimension.
9. The compact multi-scale video foreground segmentation method of claim 7, characterized by: the fusion sub-network is a network with a single-flow structure, and fuses the multi-scale spatial domain features extracted from the 2 frames to realize time domain-spatial domain feature coding; and the generated space-time multi-scale features enter a decoding sub-network for decoding.
10. The compact multi-scale video foreground segmentation method of claim 7, characterized by: the decoding subnetwork is composed of two structurally identical, mutually independent branches, each of which produces a foreground mask each time.
CN202110729146.XA 2021-06-29 2021-06-29 Compact multi-scale video foreground segmentation method Pending CN113592878A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110729146.XA CN113592878A (en) 2021-06-29 2021-06-29 Compact multi-scale video foreground segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110729146.XA CN113592878A (en) 2021-06-29 2021-06-29 Compact multi-scale video foreground segmentation method

Publications (1)

Publication Number Publication Date
CN113592878A true CN113592878A (en) 2021-11-02

Family

ID=78245109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110729146.XA Pending CN113592878A (en) 2021-06-29 2021-06-29 Compact multi-scale video foreground segmentation method

Country Status (1)

Country Link
CN (1) CN113592878A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310894A (en) * 2023-02-22 2023-06-23 中交第二公路勘察设计研究院有限公司 Unmanned aerial vehicle remote sensing-based intelligent recognition method for small-sample and small-target Tibetan antelope
CN116958535A (en) * 2023-04-14 2023-10-27 三峡大学 Polyp segmentation system and method based on multi-scale residual error reasoning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287940A (en) * 2020-10-30 2021-01-29 西安工程大学 Semantic segmentation method of attention mechanism based on deep learning
CN112446890A (en) * 2020-10-14 2021-03-05 浙江工业大学 Melanoma segmentation method based on void convolution and multi-scale fusion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446890A (en) * 2020-10-14 2021-03-05 浙江工业大学 Melanoma segmentation method based on void convolution and multi-scale fusion
CN112287940A (en) * 2020-10-30 2021-01-29 西安工程大学 Semantic segmentation method of attention mechanism based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIN ZHANG 等: ""A fast X-shaped foreground segmentation network with CompactASPP"", 《ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE》 *
JIN ZHANG 等: ""X-Net: A Binocular Summation Network for Foreground Segmentation"", 《IEEE》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310894A (en) * 2023-02-22 2023-06-23 中交第二公路勘察设计研究院有限公司 Unmanned aerial vehicle remote sensing-based intelligent recognition method for small-sample and small-target Tibetan antelope
CN116310894B (en) * 2023-02-22 2024-04-16 中交第二公路勘察设计研究院有限公司 Unmanned aerial vehicle remote sensing-based intelligent recognition method for small-sample and small-target Tibetan antelope
CN116958535A (en) * 2023-04-14 2023-10-27 三峡大学 Polyp segmentation system and method based on multi-scale residual error reasoning
CN116958535B (en) * 2023-04-14 2024-04-16 三峡大学 Polyp segmentation system and method based on multi-scale residual error reasoning

Similar Documents

Publication Publication Date Title
CN110120011B (en) Video super-resolution method based on convolutional neural network and mixed resolution
CN108596330B (en) Parallel characteristic full-convolution neural network device and construction method thereof
CN111582316B (en) RGB-D significance target detection method
CN109360156A (en) Single image rain removing method based on the image block for generating confrontation network
CN112561027A (en) Neural network architecture searching method, image processing method, device and storage medium
CN110580704A (en) ET cell image automatic segmentation method and system based on convolutional neural network
CN111488932B (en) Self-supervision video time-space characterization learning method based on frame rate perception
CN113592878A (en) Compact multi-scale video foreground segmentation method
Li et al. End-to-end learning of deep convolutional neural network for 3D human action recognition
Zou et al. Crowd counting via hierarchical scale recalibration network
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN113378775B (en) Video shadow detection and elimination method based on deep learning
CN110738663A (en) Double-domain adaptive module pyramid network and unsupervised domain adaptive image segmentation method
CN113065645A (en) Twin attention network, image processing method and device
CN110706239A (en) Scene segmentation method fusing full convolution neural network and improved ASPP module
CN112767466A (en) Light field depth estimation method based on multi-mode information
CN113538243B (en) Super-resolution image reconstruction method based on multi-parallax attention module combination
CN114638836A (en) Urban street view segmentation method based on highly effective drive and multi-level feature fusion
CN114663309A (en) Image defogging method and system based on multi-scale information selection attention mechanism
Jo et al. Multi-scale selective residual learning for non-homogeneous dehazing
CN114519383A (en) Image target detection method and system
CN114596233A (en) Attention-guiding and multi-scale feature fusion-based low-illumination image enhancement method
CN113362239A (en) Deep learning image restoration method based on feature interaction
CN113313030B (en) Human behavior identification method based on motion trend characteristics
Yun et al. Coarse-to-fine video denoising with dual-stage spatial-channel transformer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211102