CN113592878A

CN113592878A - Compact multi-scale video foreground segmentation method

Info

Publication number: CN113592878A
Application number: CN202110729146.XA
Authority: CN
Inventors: 潘志松; 张锦; 李阳; 潘欣冉; 周星宇; 贺正芸
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-11-02

Abstract

A compact multi-scale video foreground segmentation method relates to the technical field of computer vision, and provides a multi-scale compact sampling module in order to improve the coding of a scene airspace feature by a depth network from the perspective of high-level (namely large-scale) and multi-scale feature coding. The module consists of a series of parallel, compact hole convolutions with different receptive fields. The multi-scale compact sampling module is capable of capturing multi-scale features in a compact manner to combat the problem of nuclear degradation. Specifically, the compact hole convolution is elaborately designed into a cascade structure, all input neurons in a receptive field are just covered completely, and no hole or overlap is left, so that the multi-scale compact sampling module provided by the invention can sense more complete multi-scale information in different receptive fields without causing the significant increase of model parameters. Therefore, the multi-scale compact sampling module can avoid the problem of nuclear degradation on the basis of maintaining high running efficiency.

Description

Compact multi-scale video foreground segmentation method

Technical Field

The invention relates to the technical field of computer vision, in particular to a pixel-level classification task in the field of computer vision, namely video foreground segmentation.

Background

Video foreground segmentation is a basic pixel-level classification task in the field of computer vision. Given a scene S, the foreground segmentation algorithm learns that the representation of S separates background and foreground moving objects in the video sequence. The extracted foreground may provide a good compromise between detection quality and computation time for complex visual applications. Therefore, the video foreground segmentation is used as a preprocessing step of a high-level task, and has wide application value in the real world, including anomaly detection (such as remnant detection, product defect detection and fire discovery), vehicle statistics tracking and accident detection, ship and ocean traffic monitoring, animal behavior visual observation, natural environment visual monitoring (such as floater detection), human behavior analysis, background replacement and the like. Since the precision of the preprocessing step has a great influence on the performance and efficiency of subsequent tasks, it is important to learn an effective scene representation to extract an accurate foreground target.

Video foreground segmentation requires extraction of foreground moving objects of varying sizes from the background. And when the foreground object gets close to the shot from far to near, the size of the foreground object in the scene picture is changed from small to large. Therefore, the robust method needs to have accurate segmentation effect on scene targets with different scales. The multi-scale spatial domain characteristic representation of the coding scene is important content of foreground segmentation network design, and is beneficial to comprehensive inference of the model according to context information of different scales. In the process of coding multi-scale spatial domain features, the difficulty lies in coding large-scale spatial domain features.

Methods based on Full Convolution Networks (FCNs) Networks increase the Receptive field of neurons (receiving Fields) by using downsampling layers (convolution or pooling operations with step size greater than 1) to encode large-scale spatial features. The large-scale context facilitates semantic inferences from the overall appearance of the object, avoiding "blind-man-like" type local inferences. However, the increasing number of down-sampling layers causes a loss of spatial information, thereby preventing the spatial information from being recovered by the decoding process. However, decoding at an early stage of coding with higher resolution is not a good strategy, since it does not allow better inference with more level semantics. In summary, network design needs to balance between preserving complete spatial information and encoding higher-level features.

In recent years, hole Convolution (Atrou Convolution) has been used as an effective strategy to solve the contradiction between high-resolution and large-scale feature coding. Since the kernel of the hole convolution increases the field of view by inserting "holes" (holes) between the parameters through the dilation strategy, the hole convolution can also perceive large-scale contextual information without over-downsampling. However, hole convolution still has two limitations. One is the nuclear degeneration problem. As the dilation rate increases, the samples of the convolution kernel within the receptive field will become increasingly sparse, resulting in a degradation of the convolution kernel performance. The second is the single scale problem. For the feature mapping generated by the hole convolution, since the information of all neurons in the mapping is derived from the same receptive field, the semantic coding process can be considered as being limited to a single scale, and foreground objects in a scene often exist in a multi-scale manner.

In order to obtain a multi-scale scene representation, a feature pyramid strategy (for example, ASPP) adopts a parallel connection mode of multiple groups of parallel hole convolutions to extract multi-scale features, but the feature pyramid strategy is also limited by the problem of kernel degradation because the interior of the feature pyramid strategy usually contains hole convolutions with large expansion rates.

Disclosure of Invention

The invention aims to provide a compact multi-scale video foreground segmentation method, which designs a new convolution module and provides two strategies for constructing the convolution module, namely a Zoom-in focusing strategy (Zoom-out) and a Zoom-out focusing strategy (Zoom-in); the multi-scale features can be compactly encoded to solve the problem of nuclear degradation and improve the segmentation precision.

A compact multi-scale video foreground segmentation method, let x [ i ] and y [ i ] represent input signal and output signal respectively, the hole convolution operation is defined as follows:

wherein, f [ k ]^idx]Is a filter of length K, the expansion ratio r representing the corresponding sampling step; when r is 1, the hole convolution degenerates to the standard convolution operation; when a kernel of k × k 2D hole convolution can be performed on input feature x with a size of k_a×k_aWhen the region of (2) is sampled/convolved, the field of the hole convolution is called k_a；

k_a＝k+(r-1)·(k-1) (3-2)

A larger expansion rate means a larger receptive field; in order to obtain a wider receptive field and richer context information, a plurality of hole convolutions are applied in parallel or in cascade to the advanced feature map which has undergone a series of convolution and down-sampling operations;

suppose CACⁿThe expression is composed of n cascaded void convolution layers, and the convolution kernel size and expansion ratio of each layer are respectively { k₁，k₂，…，k_nAnd { r }₁，r₂，…，r_n}；CACⁿThe information for any neuron in the output signature map is derived from the input signature map within its corresponding receptive field, and there is neither information omission nor "oversampling";

the reception fields of the compact hole convolution layers are recorded as RFC, and under the condition of no information omission and 'overlapped sampling', the relation between the RFC and each layer convolution kernel satisfies the formulas 3-3 and 3-4.

RFC＝k₁k₂…k_n (3-3)

According to the size of RFC, for CACⁿThe design is performed to determine the number of convolution layers n, the size k of each layer of convolution kernel, and its expansion rate r.

When the RFC is less than or equal to 5, the single-layer convolution can meet the requirement of the receptive field. In this case, k₁＝RFC，r₁At this point, the first compact hole convolution layer degenerates to a standard convolution layer. With the increase of RFC, the compact hole convolution layer cannot avoid using a multi-layer cascade form; in order to ensure that the output neuron is positioned at the central position of the receptive field, the size of the nucleus is always selected to be odd.

When 5 is<When RFC is less than or equal to 25, 2 satisfied k needs to be used₁×k₂Cascade-connecting the convolution layers of RFC;

1)r₁＝k₂，r₂since the process uses decreasing dilation rates, the area experienced by the convolution is first "squeezed" to a k since the information mapped in the RFC range of the input features is first "squeezed" to a gradual contraction, i.e., the "zoom-out-focus" strategy₂×k₂Then further focusing to the central neuron;

2)r₁＝1，r₂＝k₁as the process uses increasing dilation rates, the area felt by the neuron is gradually expanding, i.e., a "zoom-in-focus" strategy; first collect k₁×k₁Local information in the region, then k at different positions in the whole RFC₁×k₁The information of the region is concentrated to the central neuron.

When RFC is more than 25, the CAC is constructed in a recursion modeⁿWherein n is more than or equal to 3; CACⁿCan be regarded as a convolution CAC consisting of an nth (or 1 st) void convolution layer and a front n-1 (or rear n-1) layers^n-1Two parts are formed; wherein, CAC^n-1Is considered as a common hole convolution with a convolution kernel of k^n-1＝k₁k₂…k_n-1Expansion ratio of r ^n-11 is ═ 1; in this case, the nth (or 1 st) hole convolution layer and the front n-1 (or rear n-1) layers of convolution CAC^n-1Based on "zoom in focus" (or "zoom out focus")Coke ") strategy is cascaded; two-stage void convolution cascade, CACⁿReception field of RFC ═ k₁k₂…k_n(ii) a Based on the recursion mode, the CAC formed by convolution cascade of n-level holesⁿCompact sampling of features in the receptive field is achieved as a whole; will CACⁿDenoted CAC (k)₁，k₂，…，k_n；c₁，c₂，…，c_n)，r＝(r₁，r₂，…，r_n) Wherein k is_i、c_iAnd r_iRespectively representing the convolution kernel size, the output channel number and the expansion rate of the ith layer.

Preferably, the compact hole convolution layer is expanded to a multi-scale space by adopting a multi-scale compact sampling module, and the multi-scale compact sampling module consists of 5 groups of parallel compact hole convolutions with different receptive fields; compact hole convolution divided into CompactASPP_iAnd CompactASPP_o；CompactASPP_iBased on a reduction focusing strategy, the expansion rates of the convolution of the compact holes are distributed in a decreasing mode; CompactASPP_oThe expansion rate of the compact hole convolution is arranged in an incremental manner based on a magnifying focusing strategy.

Preferably, the number of channels of the input feature mapping of the compact hole convolution is large, the number of output feature channels is small, and in order to control the size of the module to ensure the high efficiency of calculation, the small-scale convolution kernel is preferentially arranged at the bottom layer of the compact hole convolution; the number of output channels of each convolution layer in the compact hole convolution layer is consistent with that of the last convolution layer.

The invention provides a more effective module for improving the coding of a depth network on the spatial domain characteristics of a scene from the view point of high-level (namely large-scale) and multi-scale characteristic coding: the multi-scale compact sampling module CompactASPP. The module consists of a series of parallel Compact Aperture Convolutions (CACs) with different receptive fields. CompactASPP is able to capture multi-scale features in a compact manner to cope with nuclear degeneration issues. Specifically, the convolution module is designed to be a cascade structure, all input neurons in the receptive field are covered completely, and no hole or overlap is left, and the CompactASPP provided by the invention can sense more complete multi-scale information in different receptive fields without causing the model parameters to increase significantly. Thus, CompactASPP can avoid the problem of nuclear degradation while maintaining operational efficiency.

A Fast X-Net network is designed based on CompactASPP, and X-Net is improved. The multiple-input and multiple-output architecture adopted by the X-Net can effectively utilize time domain information, but the defect of high computational complexity is caused by extracting multi-scale spatial domain features based on an image pyramid strategy. In order to pursue a more efficient segmentation speed, the compact ASPP module is embedded into an X-Net framework, and an original image pyramid strategy is removed, so that the processing speed is improved by 63.6%. Also, Fast X-Net has higher segmentation accuracy.

The invention introduces the concept of compact sampling to design a new convolution module, and provides two strategies for constructing the new convolution module, namely a Zoom-out strategy and a Zoom-in strategy;

the invention designs a new multi-scale module, CompactASPP, based on the new convolution module, and the module can compactly encode multi-scale features to solve the problem of nuclear degradation and improve the segmentation precision;

the invention provides a Fast X-Net method based on CompactASPP, which reaches a leading level on three data sets of CDnet 2014, SBI2015 and UCSD.

Drawings

Fig. 1a is a sample schematic of a convolution standard convolution with a dilation rate r-1.

Fig. 1b is a sample schematic of the compact hole convolution at a dilation rate r of 4.

FIG. 2 is a schematic diagram of a pyramid pooling of void space.

FIG. 3 is a schematic illustration of the "checkerboard effect" and "overlapping sampling" phenomena of the cascade hole convolution.

FIG. 4 is a compact sampling schematic of the cascaded hole convolution.

FIG. 5 is based on CAC^n-1And nth order convolutional layer to construct CACⁿSchematic diagram.

FIG. 6 is a schematic diagram of the construction of CompactASPP.

FIG. 7 is a Fast X-Net framework diagram.

Fig. 8 is a visualization of a 20-frame experiment on the CDnet 2014 data set.

Fig. 9 is a visualization of multi-scale modules on a CDnet 2014 test frame.

FIG. 10 is a visualization of Fast X-Net in a Toscana scenario.

Detailed Description

The compact hole convolution is obtained by utilizing the hole convolution, and a compact multi-scale feature fusion module, namely compact ASPP, is provided on the basis of the compact hole convolution.

Convolution of holes

The hole Convolution (AC), also known as dilation Convolution (Dilate Convolution), was originally developed in the wavelet decomposition algorithm Atrous. The DeepLab v1 deploys the hole convolution in the FCN framework for the first time, so that the purposes of increasing the receptive field and capturing the context dependence in a wider range are achieved on the premise of reducing down-sampling and maintaining the high resolution of feature mapping. The following description will be given by taking one-dimensional hole convolution as an example. Let x [ i ] and y [ i ] denote the input signal and the output signal, respectively, the hole convolution operation can be defined as follows:

wherein, f [ k ]^idx]Is a filter of length k and the expansion ratio r represents the corresponding sampling step. When r is 1, the hole convolution degenerates to the standard convolution operation. When a kernel of k × k 2D hole convolution can be performed on input feature x with a size of k_a×k_aWhen the region of (2) is sampled/convolved, the field of the hole convolution is called k_a. It can be easily found from the formula 3-2 that a larger expansion ratio means a larger receptive field. To obtain a wider receptive field and richer contextual information, multiple hole convolutions are typically applied in parallel or in cascade to a set of convolutions that have been subjected toAnd advanced feature mapping for downsampling operations.

k_a＝k+(r-1)·(k-1) (3-2)

In the parallel mode, the multiple hole convolution layers sample the same input feature map at different respective expansion rates to obtain the receptive fields of different scales. This pattern is commonly referred to as void space Pyramid Pooling (ASPP). However, since there is usually a hollow core with an excessively large expansion rate (e.g., r-6 in fig. 2)&12&18) And thus faces the problem of nuclear degeneration. Specifically, information about neuron p in layer l is derived from layer l-1 with p as the center and size k_a×k_aThe neuron in the region (the number of neurons in the region is k)_a×k_a) However, the number of neurons actually used is only k × k. For example, when k is 3 and r is 6, only the neurons of 9/169 in the receptive field region participate in the calculation. Correlation studies show that as the expansion rate r increases, the sampling of the hole kernel (e.g. 3 × 3) in the receptive field becomes more and more sparse, and the characteristic coding capability thereof becomes weaker and weaker until the regression is 1 × 1 convolution kernel. That is, the weights of the cores at the remaining positions except the center position are close to 0.

As shown in fig. 3, in the cascade mode, a plurality of void convolution layers with smaller expansion ratio generate a large receptive field effect in a stacked manner, thereby alleviating the nuclear degeneration problem. However, when all layers use the same dilation rate r, the top-most output neurons are sampled in a "checkerboard effect" manner, losing most of the information. FIG. 3a is the "checkerboard effect" of the cascaded hole convolution, and FIG. 3b is the "oversampling" phenomenon of the cascaded hole convolution. Fig. 3a shows the "checkerboard effect" produced by the cascade of two void convolutional layers, where k is 3 and r is 2. Specifically, the information of the output neurons of the l-th layer convolution at the star positions is derived from the information of the input features mapped at the 9 dark circle positions. Similarly, the forward recursion shows that the information of the output neurons at the positions of the 9 dark circles in the layer l-1 convolution is derived from the information of the input features mapped at the positions of the dark lattices, and the deeper the color of the dark lattices, the more the sampling times. Considering 2-stage cascade convolution as a wholeThe information of the output neurons at the star positions comes from the input information at the dark grid positions. As an improvement, Hybrid Atom Convolution (HAC) employs a gradually increasing expansion ratio during the cascade. In this case, the output neuron at the upper layer can cover a square area (see fig. 3b) to eliminate the "hole" in the "checkerboard effect". However, the overlapping sampling phenomenon existing in the HAC can limit further improvement of the receptive field, and also bring redundant calculation. Can overlapping samples be eliminated and the field of view further expanded? In order to explore this problem, the present invention has been studied from the viewpoint of reasonably configuring the expansion rate of the cascade hole Convolution, and proposes a Convolution having a Compact sampling characteristic and named Compact Atom Convolution (CAC). FIG. 3 a: r is_l-1＝r_l＝2，b：r_l-1＝1，r_l2; stars represent output neurons of the l layer convolution, and information of the output neurons is derived from a red circle, namely the output neurons of the l-1 layer convolution; the information for the dark circle neurons is derived from the l-1 th layer of convolved input neurons at the dark grid positions, and deeper colors indicate more samples.

Convolution of tight holes

Because of the correlation between spatial neighborhoods, foreground/background semantic inference can benefit from rich context dependencies. In theory, standard convolutions can be directly utilized to compactly encode different scales of context. However, in the case of a large receptive field, the direct use of a large convolution kernel to extract long range dependence is not a judicious choice, as it is likely to result in severe overfitting. To this end, this section uses a small scale convolution kernel based (no more than 5 x 5) to construct CACs to encode different scales of context dependence.

Suppose CACⁿThe expression is composed of n cascaded void convolution layers, and the convolution kernel size and expansion ratio of each layer are respectively { k₁，k₂，…，k_nAnd { r }₁，r₂，…，r_n}. Ideally, the present invention contemplates CACⁿInformation for any neuron in the output signature map is derived from inputs within its corresponding receptive fieldFeature mapping and neither information omission ("holes") nor "overlapping sampling". For convenience of description, the receptive Field of CAC is denoted as RFC (received Field of CAC). It can be readily seen that the RFC relationship to the layers of the convolution kernel satisfies equations 3-3 and 3-4 without "holes" and "overlapping samples". Following according to the size of RFC, for CACⁿThe design is performed to determine the number of convolution layers n, the size k of each layer of convolution kernel, and its expansion rate r.

RFC＝k₁k₂…k_n (3-3)

(1) When the RFC is less than or equal to 5, the single-layer convolution can meet the requirement of the receptive field. In this case, k₁＝RFC，r₁CAC at this time when 1¹Degenerates to the standard convolution layer. As RFC increases, CAC will inevitably use a form of multi-layer concatenation. In order to ensure that the output neuron is positioned at the central position of the receptive field, the size of the nucleus is always selected to be odd;

description of the variables of the Table

(2) When 5 is<When RFC is less than or equal to 25, 2 satisfied k needs to be used₁×k₂Cascade of convolutional layers of RFC. Specifically, the invention provides two strategies to construct compact hole convolution: 1) r is₁＝k₂，r₂Since this process uses a decreasing dilation rate, the region felt by the convolution is shrinking gradually, so it is named the "Zoom-in" strategy. Specifically, information that is input feature mapped within RFC is first "squeezed" to one k₂×k₂Then further focused to the central neuron. The following is a detailed description with reference to fig. 4 a: the input features in the receptive field (9 x 9) are first compressed to dark circle positions by the layer l-1 convolution and then focused to star positions by the layer l convolution. 2) r is₁＝ 1，r₂＝k₁Since this process uses an increasing dilation rate, the area felt by the neuron is gradually expanding, hence the name "zoom in focus" strategy(Zoom-out). The strategy first gathers k₁×k₁Local information in the region, then k at different positions in the whole RFC₁×k₁The information of the region is concentrated to the central neuron.

(3) When RFC is more than 25, the CAC is constructed in a recursion modeⁿWherein n is more than or equal to 3; CACⁿCan be regarded as a convolution CAC consisting of an nth (or 1 st) void convolution layer and a front n-1 (or rear n-1) layers^n-1Two parts are formed; wherein, cAc^n-1Is considered as a common hole convolution with a convolution kernel of k^n-1＝k₁k₂…k_n-1Expansion ratio of r ^n-11 is ═ 1; in this case, the nth (or 1 st) hole convolution layer and the front n-1 (or rear n-1) layers of convolution CAC^n-1Cascading is based on a "zoom-in focus" (or "zoom-out focus") strategy (see fig. 5). Similar to two-stage hole convolution cascade, CACⁿReception field of RFC ═ k₁k₂…k_n. Based on the recursion mode, the CAC formed by convolution cascade of n-level holesⁿCompact sampling of features within the receptive field is achieved as a whole. In order, the invention relates to CACⁿDenoted CAC (k)₁，k₂，…，k_n；c₁，c₂，…，c_n)，r＝(r₁，r₂，…，r_n). Wherein k is_i、c_iAnd r_iRespectively representing the convolution kernel size, the output channel number and the expansion rate of the ith layer. It is emphasized that a Batch Normalization layer (BN) is additionally added between convolutional layers to reduce covariance offset and reduce training difficulty.

Because the value of the convolution kernel k faces certain constraint (formula 3-4), the RFC of the CAC is correspondingly restricted. For example, one RFC cannot be designed₀CAC of 13. In this case, the approximate receptive field may be selected for substitution, e.g., RFC₁＝3×4≈RFC₀I.e. k₁＝3，k₂＝4。

2.3 CompactASPP

In order to overcome the universality in multi-scale modules such as ASPP, FPM and the likeThe invention provides a compact sampling module, namely compact ASPP (compact atomic Spatial Pyramid sampling) by expanding CAC (computer aided regression) to a multi-scale space. CompactASPP consists of 5 parallel sets of CACs with different receptive fields. The CompactASPP is further classified into CompactASPP according to the information aggregation mode_iAnd CompactASPP_o。CompactASPP_iBased on a demagnifying focusing strategy, in which the expansion rates of the CACs are all distributed in a decreasing manner (see r in FIG. 6)_i)；CompactASPP_oBased on the magnification focusing strategy, the expansion ratios of the CACs are arranged in an increasing manner (see r in FIG. 6)_o). For fair comparison, the invention determines the reception fields of the CACs in the CompactASPP according to the multi-scale reception field of the FPM, so that the reception fields of the CACs and the reception fields of the CACs correspond. As shown in fig. 6, 5 sets of CAC produced 5 different scale features that were aggregated along the channel dimension, fed into the BN and 2D Dropout layers to increase the generalization capability.

And (5) parameter control principle. Considering that the number of channels of the input signature map of the CACs is generally large (e.g. 512) and the number of channels of the output signature thereof is generally small (e.g. 64), in order to control the module size to ensure the efficiency of the calculation, the CACs need to be designed according to the following 2 principles: (1) the small scale convolution kernel is preferentially arranged at the bottom layer of the CACs; (2) the number of output channels of each layer of convolution in the CAC is consistent with that of the last layer of convolution. For example, when the input/output channels are 512/64, RFC 15, respectively, CAC is designed as a cascade of 2-layer hole convolutions. According to the above principle, the number of output channels of each layer of convolution is 64, the size of the bottom layer convolution kernel is 3 × 3, and the size of the top layer convolution kernel is 5 × 5, which is denoted as CAC (3; 5; 64; 64). The parameter value is about 388k (512 × 3 × 3 × 64+64 × 5 × 5 × 64), which is much smaller than the parameter value of the standard convolution of the same field, about 7M (512 × 15 × 15 × 64).

3 Fast X-Net

As shown in fig. 7, Fast X-Net is also an X-type architecture consisting of an encoding sub-network, a decoding sub-network, and a blending sub-network to integrate time domain features. The network basic unit includes: a core of 3 × 3 convolutional layers (conv), a core of 2 × 2 max pooling layers (max boosting), a core of 1 × 1, 3 × 3, and 5 × 5 anti-convolutional layers (Tconv), a random discard layer (dropout), relu and sigmoid activation functions, and a CompactASPP multi-scale feature coding module.

Fast X-Net removes the image pyramid strategy used by X-Net coding sub-network and embeds CompactASPP on top of the coding sub-network. In this case, the encoding subnetwork is operated once, so that the multi-scale spatial domain features can be efficiently extracted from a single image, and the image pyramid strategy requires the encoder to be operated for multiple times to complete the same task. As shown in FIG. 7, the coding sub-network of Fast X-Net (including CompactASPP module) is composed of 2 twin branches with the same structure and shared weight. They can extract features with the same pattern from two similar frames. The multi-scale feature representations extracted from two consecutive frames are then fed into the merging sub-network after aggregation along the channel dimension.

A converged sub-network is a network of a single-stream structure. The method fuses multi-scale spatial domain features extracted from 2 frames to realize time domain-spatial domain feature coding. And the generated space-time multi-scale features enter a decoding sub-network for decoding.

Decoding sub-network consists of two branches with the same structure but independent of each other. Each branch produces one foreground mask at a time (right view in fig. 7). It should be noted that Fast X-Net also eliminates the cross-layer connection between the encoder and decoder of X-Net to further improve computational efficiency.

Analysis of experiments

In order to verify the performance of the multi-scale feature coding module CompactASPP, the invention performed experiments on multiple data sets. According to the invention, experimental settings are introduced, then an ablation experiment is carried out based on a CDnet 2014 data set, the effectiveness and the advancement of the algorithm are verified, and finally a supplementary experiment is further carried out on an SBI2015 data set.

Experimental setup

And (5) training and optimizing the model. The invention develops end-to-end training based on the Fast X-Net network shown in FIG. 7. To exploit the high level of semantic knowledge and improve training efficiency, the encoding sub-network is initialized with VGG-16 pre-trained on ImageNet. The experiment adopts a Keras framework taking Tensorflow as a rear end as a deep learning platform and carries out model optimization based on SFL. The parameters are updated using the RMSProp optimizer, where epsilon and initial learning rate are set to 1e-8 and 1e-4, respectively, and batch size is set to 1. The maximum number of training cycles per scene is set to 60 rounds (epochs) and the Early Stop (Early Stop) threshold is set to 10. It is emphasized that no gradient back propagation is performed for the no region of interest (NON-ROI) and the uncertain boundary regions.

And selecting a training sample. For the CDnet 2014 dataset, the present invention performs model performance evaluations at both settings of 50-frame sample (m-50) and 20-frame sample (m-20). Wherein, the data provided by FgSegNet is adopted for 50-frame samples, 80% of the samples are randomly selected to construct a training set (40 frames), 20% of the samples are used for model verification (10 frames), and the rest samples in the whole video are used as test samples. It is emphasized that the manual random selection of training data (and verification data) from the entire video sequence is a common practice in academia. However, there is some unreasonable in this "taking the entire video sequence as the sampling domain": a large number of foreground instances (instances) in the test sample may "appear" in the training sample, which may result in an overestimation of the model performance. In contrast, in actual deployment, foreground instances in the video are typically "non-existent" instances, in which case the model performance is naturally compromised. To avoid an "over-estimation" of the model performance, the present invention considers a more difficult setup in a 20-frame experiment: and selecting a sub-sequence close to the starting position or the ending position as a sampling domain as much as possible from the whole video sequence. In order to reduce the influence of human factors, sampling is performed at equal intervals. For example, the complete video sequence of the highway scene is [470, 1700] (the rest sequences are not considered because the data set does not provide a group channel), and the 20-frame experiment is to select samples at 12 intervals between [1424, 1436, 1448, …, 1700], and the specific sample sequence is [1424, 1436, 1448, …, 1652 ].

And (5) constructing a training set. Unlike the single-frame input model, Fast X-Net as a "paired input" network requires selection of paired frames to construct the training set. Given m frames, the invention constructs a training set based on the following strategy:

a. the m frames are sorted according to their order from small to large and renumbered as 1, 2, …, m.

b. To generate "frame pairs", any frame is matched with its neighbors within a window of length 2 × interval, and all "frame pairs" that satisfy the condition constitute the training set.

For the case of fewer labeled samples, the increment sets interval to 6 to increase training set size. In this case, the number of "frame pairs" that the first three frames and the last three frames can match is 4, 5, 6, and the number of "frame pairs" that the intermediate frames can match is 7. When m is 20, the training set size is 128 "frame pairs" (4+5+6+14 × 7+6+5+ 4).

Results and analysis of the experiments

CDnet 2014 data set experiments. As shown in Table 1, the mean recall ratio (Re) and the mean accuracy ratio (Pr) of Fast X-Net in a 50-frame experiment both exceed 97%, and the comprehensive performance (F value) reaches 0.976. Wherein, the average F value of the BL scene is the highest in all categories, and reaches 0.995; while the average F value for LF-like scenes is lowest, but also exceeds 0.9. The main reason is that there is a very challenging scene in LF, where foreground objects are very few, the scale is very small, and the dynamically changing background and foreground features have high similarity, which all bring about serious interference. It is emphasized that the model performance evaluation index given in table 3-1 was obtained based on the test samples only, i.e., 50 frames for training and validation, and was not used for performance evaluation.

TABLE 1 Fast X-Net test results based on 50-frame samples on CDnet 2014 data set

Compared to advanced algorithms. The present invention compares Fast X-Net with 7 other high performance methods. As shown in table 2, X-Net, FgSegNet _ M, and Cascade CNN employ an image pyramid strategy to extract multi-scale information; FgSegNet _ S and FgSegNet _ v2 implement multi-scale feature coding based on a feature pyramid module similar to ASPP; 3D SegNet utilizes 3D spatio-temporal filtering to extract spatio-temporal features; IUTIS-5 is a traditional unsupervised method based on an integration technology, which integrates the output results of a plurality of high-performance traditional background modeling methods through a genetic algorithm. It is emphasized that 3D SegNet uses 70% of the samples as the training set, and the rest of the supervised methods use the same 50 frame samples as the training set.

TABLE 2 comparison of Fast X-Net and other high Performance method F value Performance

Wherein Cascade CNN, FgSegNet _ M, X-Net (outlets) use image pyramid strategy, FgSegNet _ S, FgSegNet _ v2, Fast X-Net (outlets) use feature pyramid strategy.

For 50-frame experiments, the supervised approach based on the deep network is significantly better than the traditional unsupervised approach based on low-level features (IUTIS-5), especially in the extremely challenging scenarios of PTZ, NV, LF, etc. In addition, Fast X-Net proposed by the present invention outperforms other existing methods with great advantage. It is emphasized that FgSegNet _ S is the baseline comparison method of the present invention, which employs a feature pyramid strategy to model multi-scale features, as well as Fast X-Net, and the multi-scale receptive field is substantially the same. The performance advantage of Fast X-Net was further expanded for the 20-frame experiment. As shown in Table 3, the F values for Fast X-Net were increased by 2.3%, and 1.5% compared to X-Net, FgSegNet _ S, and FgSegNet _ v2, respectively. In the aspect of speed, the foreground segmentation speed of Fast X-Net based on compact ASPP is obviously superior to that of an image pyramid method, and is improved by 63.6% and 50% compared with X-Net and FgSegNet _ M.

TABLE 3 comparison of the comprehensive Performance of Fast X-Net and 4 high Performance methods (20-Frames) (speed test based on Invitta 1080Ti GPU and video resolution 320X 240)

And (5) visualization effect. In order to more intuitively demonstrate the performance of Fast X-Net and its module CompactASPP, the present invention provides some qualitative visualization results. As shown in fig. 8, the method of the present invention has good robustness to foreground scale variation, and shows high recall rate under different scale foreground conditions of large, medium and small. In nightv. scenarios, Fast X-Net is better able to overcome the challenges of illumination mutation than FgSegNet _ S, and thus produces fewer false positive samples, since the nuclear degradation problem is avoided. Wherein a: input frame, b: group truths, c: fast X-Net, d: FgSegNet _ v2, e: FgSegNet _ S, f: and (4) X-Net.

CompactASPP ablation experiments. Two sets of comparative experiments were designed to further verify the performance of the CompactASPP module. In a first comparative experiment, the present invention removed the entire multi-scale module from Fast X-Net and referred to this modified network as Fast X-Net_baselineThis experiment was used to test the performance of the network without the multi-scale model. In a second comparative experiment, the invention replaces the CompactASPP module with the ASPP module (both having the same size receptive field), and the modified network is called Fast X-Net_aspp. As shown in tables 3-4, the ASPP module produced an average F value of 0.974 in a 50 frame experiment. This is in contrast to Fast X-Net, which does not use multiscale_nThe improvement is 0.5%, and the CompactASPP can bring about a further improvement of 0.2% on this basis. The corresponding improvement is more obvious in the 20-frame experiment. The CompactASPP module gives a gain of 0.4% to the network over the ASPP module. The visualization results shown in fig. 9 indicate that the tightly sampled pattern helps correct otherwise misclassified positive samples. Wherein a: input frame, b: group route, c: fast X-Net_baseline， d：Fast X-Net_aspp，e：Fast X-Net_i.. In addition, CompactASPP_iAnd CompactASPP_oSimilar performance is shown, which means that multi-scale features can be effectively extracted by using any one of two compact sampling strategies proposed by the invention. It should be noted that, in ambiguous cases, the experimental results of the invention regarding Fast X-Net are mainly based on CompactASPP_i。

Tables 3-4 multiscale Module comparison experiments

SBI2015 data set experiments. To further verify the advancement of the Fast X-Net algorithm, the present invention performed supplementary experiments on both SBI2015 and UCSD datasets. SBI2015 contains 14 segments of video and provides a group route for the entire video. For ease of comparison, the same training setup and training samples as FgSegNet were used. Specifically, 16% of all samples were used for model training, 4% as model validation, and the remaining 80% constructed the test set. As shown in tables 3-5, Fast X-Net defeated the other 4 high performance algorithms and achieved the best performance. It is worth emphasizing that FgSegNet _ v2 already obtained a very high F value (0.984), which is a further improvement of 0.5% by the process of the invention. In all scenarios, Toscana's performance was lowest (0.962). Mainly because the video only comprises 6 frames of images, and Fast X-Net only utilizes 2 frames of the images for model training. Because the number of training samples is too small, the model is inevitably overfitting. Nevertheless, 0.962 is still an acceptable performance, which also shows the robustness of Fast X-Net in small sample learning cases.

Tables 3-5 Fast X-Net test results on SBI2015 dataset and comparison to high Performance methods

The present invention shows all training frames (column a) and a typical test frame (column b) in fig. 10. Wherein a: all training frames, b: typical test frame, c: group truths, d: and (4) Mas in the foreground. A large number of false positive samples appear in the red circle of the d column, because all training frames in the area of the a column are foreground, and thus the model "does not see" the background information corresponding to the area, and therefore misjudgment occurs.

Claims

1. A compact multi-scale video foreground segmentation method is characterized by comprising the following steps:

let x [ i ] and y [ i ] denote the input signal and the output signal, respectively, and the hole convolution operation is defined as follows:

k_a＝k+(r-1)·(k-1) (32)

RFC＝k₁k₂…k_n (33)

2. The compact multi-scale video foreground segmentation method of claim 1, characterized by: when the RFC is less than or equal to 5, the single-layer convolution can meet the requirement of the receptive field. In this case, k₁＝RFC，r₁At this time, the compact hole convolution layer degenerates to a standard convolution layer. With the increase of RFC, the compact hole convolution layer cannot avoid using a multi-layer cascade form; in order to ensure that the output neuron is positioned at the central position of the receptive field, the size of the nucleus is always selected to be odd.

3. The compact multi-scale video foreground segmentation method of claim 1, characterized by: when RFC is more than 5 and less than or equal to 25, 2 satisfying k needs to be used₁×k₂Cascade-connecting the convolution layers of RFC;

2)r₁＝1，r₂＝k₁as the process uses increasing dilation rates, the area felt by the neuron is gradually expanding, i.e., a "zoom-in-focus" strategy; first collect k₁×k₁Local information in the region, then k at different positions in the whole RFC₁×k₁Information of the regionConcentrated to the central neuron.

4. The compact multi-scale video foreground segmentation method of claim 3, characterized by: when RFC is larger than 25, constructing CACN in a recursion mode, wherein n is larger than or equal to 3; CACⁿCan be regarded as a convolution CAC consisting of an n-th or 1-th void convolution layer and a front n-1 layer or a rear n-1 layer^n-1Two parts are formed; wherein, CAC^n-1Is considered as a common hole convolution with a convolution kernel of k^n-1＝k₁k₂…k_n-1Expansion ratio of r^n-11 is ═ 1; in this case, the nth or 1 st hole convolutional layer and the front or rear n-1 layers of convolutional CAC^n-1Cascading based on a 'zoom-in focus' or 'zoom-out focus' strategy; two-stage void convolution cascade, CACⁿReception field of RFC ═ k₁k₂…k_n(ii) a Based on the recursion mode, the CAC formed by convolution cascade of n-level holesⁿCompact sampling of features in the receptive field is achieved as a whole; will CACⁿDenoted CAC (k)₁，k₂，…，k_n；c₁，c₂，…，c_n)，r＝(r₁，r₂，…，r_n) Wherein k is_i、c_iAnd r_iRespectively representing the convolution kernel size, the output channel number and the expansion rate of the ith layer.

5. The compact multi-scale video foreground segmentation method of claim 4, characterized by: expanding the compact hole convolution layer to a multi-scale space and adopting a multi-scale compact sampling module, wherein the multi-scale compact sampling module consists of 5 groups of parallel compact hole convolutions with different receptive fields; compact hole convolution divided into CompactASPP_iAnd CompactASPP_o；CompactASPP_iBased on a reduction focusing strategy, the expansion rates of the convolution of the compact holes are distributed in a decreasing mode; CompactASPP_oThe expansion rate of the compact hole convolution is arranged in an incremental manner based on a magnifying focusing strategy.

6. The compact multi-scale video foreground segmentation method of claim 5, characterized by: the input characteristic mapping channels of the CACs are large in number, the output characteristic channels are small in number, and in order to control the size of a module and ensure the high efficiency of calculation, the small-scale convolution kernel is preferentially arranged at the bottom layer of the compact hole convolution; the number of output channels of each convolution layer in the compact hole convolution layer is consistent with that of the last convolution layer.

7. The compact multi-scale video foreground segmentation method of claim 1, characterized by: the Fast X-Net integrates time domain characteristics by an X-type framework formed by a coding sub-network, a decoding sub-network and a fusion sub-network, and a network basic unit comprises: a kernel of 3 × 3 convolutional layers, a kernel of 2 × 2 max pooling layers, a kernel of 1 × 1, 3 × 3, and 5 × 5 deconvolution layers (Tconv), random discard layers, relu and sigmoid activation functions, and a multi-scale feature coding module.

8. The compact multi-scale video foreground segmentation method of claim 7, characterized by: removing an image pyramid strategy used by an X-Net coding sub-network by Fast X-Net, and embedding a multi-scale compact sampling module into the top of the coding sub-network; the multi-scale airspace features can be efficiently extracted from a single image only by operating the coding subnetwork once; the coding sub-network of Fast X-Net is composed of 2 twin branches with the same structure and shared weight, the features with the same mode are extracted from two similar frames, and the multi-scale features extracted from two continuous frames are sent into a fusion sub-network after being aggregated along the channel dimension.

9. The compact multi-scale video foreground segmentation method of claim 7, characterized by: the fusion sub-network is a network with a single-flow structure, and fuses the multi-scale spatial domain features extracted from the 2 frames to realize time domain-spatial domain feature coding; and the generated space-time multi-scale features enter a decoding sub-network for decoding.

10. The compact multi-scale video foreground segmentation method of claim 7, characterized by: the decoding subnetwork is composed of two structurally identical, mutually independent branches, each of which produces a foreground mask each time.