CN117197183A

CN117197183A - Moving object detection method based on multi-scale expansion convolution encoding-decoding

Info

Publication number: CN117197183A
Application number: CN202311179071.8A
Authority: CN
Inventors: 杨依忠; 夏婷婷; 张景润
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2023-12-08

Abstract

The invention discloses a moving target detection method based on multi-scale expansion convolution encoding-decoding, which comprises the following steps: 1. constructing a multi-information video sequence set, 2, constructing an encoding-decoding model with multi-scale expansion convolution, 3, training a network model, 4, and testing the network model. The invention can solve the problems that the detail features in the traditional encoder-decoder can not transmit deeper layers, can only detect simple scenes, and the like, thereby being capable of rapidly and accurately extracting foreground targets in real complex scenes, especially small target object detection influenced by background information, and further improving the detection capability of the foreground targets.

Description

Moving object detection method based on multi-scale expansion convolution encoding-decoding

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a moving target detection method based on multi-scale expansion convolution coding-decoding.

Background

The moving target detection technology is a hot topic discussed in the field of computer vision, and plays an important role in various fields such as human motion analysis, anomaly detection, man-machine interaction, robot navigation and the like. Detecting locally changing foreground objects from a video scene has been a challenging task due to the effects of light changes, shadows, dynamic background, etc. on the video. One of the widely used methods of extracting foreground objects is to utilize a background subtraction method.

The traditional algorithm is unsupervised, relies on modeling of a background model to distinguish moving objects, is easy to be interfered by scene factors, can only process scenes with simple backgrounds, shows high scene dependence, and has not yet reached the standard for detecting unseen videos irrelevant to training videos.

Recently, the deep learning-based algorithm exhibits excellent scene learning ability, and detection accuracy is far superior to that of the conventional moving object detection method, in which the encoder-decoder structure is most used, but the method is prone to information loss and problems of degradation of detection performance with respect to small objects due to low receptive field during the sub-sampling process.

Disclosure of Invention

The invention aims to solve the defects of the prior art, and provides a moving target detection method based on multi-scale expansion convolution encoding-decoding, so as to solve the problems that detailed features in a traditional encoder-decoder cannot be transmitted deeper, only simple scenes can be detected and the like, thereby being capable of rapidly and accurately extracting a foreground target in a real complex scene, particularly detecting a small target object influenced by background information, and further improving the detection capability of the foreground target.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

the invention discloses a moving target detection method based on multi-scale expansion convolution coding-decoding, which is characterized by comprising the following steps:

step 1, constructing a multi-information video sequence set;

step 1.1, selecting a T-section video sequence from video data with pixel-level labels, and carrying out normalization processing on each frame of image in the video sequence to obtain a normalized video sequence set S= { S ₁ ,S ₂ ,...,S _t ,...,S _T }，S _t Representing a video sequence of segment t, anf _i ^t For video sequence S of the t-th segment _t M is the number of image frames in each video sequence; the true pixel level label defining the t-th segment video sequence is denoted Y _t ∈{0,1}；

Step 1.2, respectively toProcessing the previous X frame and the previous Y frame images to obtain a t-th segment ' blank ' video sequence S ' _t And the t-th nearest video sequence S _t ；1<X<Y<M；

Step 1.3, respectively performing S-layer training by using the pre-trained HRnet network _t ,S′ _t ,S″ _t Processing to obtain t-th segment original semantic information video sequence F _t T-th segment 'null' semantic information video sequence F _t ' t-th segment nearest semantic information video sequence F _t ″；

Step 1.4, constructing the t-th segment of the multi-information video sequence C _t ＝{S _t ,S′ _t ,S″ _t ,F _t ,F′ _t ′,F″ _t }；

Step 2, constructing a multi-scale expansion convolution encoding-decoding network, which comprises the following steps: the device comprises a multi-scale expansion convolutional coding module, a characteristic enhancement and detail processing module and a characteristic decoding module;

step 2.1, constructing the multi-scale expansion convolution coding module, wherein the multi-scale expansion convolution coding module is formed by connecting N coding blocks in series, and each coding block is formed by connecting one expansion convolution block and one convolution block; the expansion convolution block includes: a parallel convolution layer with different expansion rates; the outputs of the a-layer convolution layers are added and then are input into a subsequent convolution block after being processed by a Dropout layer; the convolution block is formed by serially connecting a convolution layer, a BN layer and a ReLU layer in sequence;

the t-th segment of the multi-information video sequence C _t Processing the input multi-scale expansion convolution coding module to obtain a t-th section multi-scale coding characteristic sequence O output by N coding blocks _t ＝{O _t ¹ ,O _t ² ,...,O _t ⁿ ,...,O _t ^N -a }; wherein O is _t ⁿ Representing the multi-scale coding features of the nth coding block output;

step 2.2, constructing the characteristic enhancement and detail processing module, wherein the characteristic enhancement and detail processing module is formed by connecting N characteristic processing blocks in parallel; each characteristic processing block is formed by connecting a parallel attention module and a detail processing module in parallel; the parallel attention module is formed by connecting a position attention unit and a channel attention unit in parallel;

step 2.2.1 parallel attention module position attention unit pair O in nth feature processing block _t ⁿ Processing to obtain nth position fusion feature E _t ⁿ ；

Step 2.2.2 nth channel attention unit pair O _t ⁿ Processing to obtain nth channel fusion feature E _t ′ ⁿ ；

Step 2.2.3, E _t ⁿ And E is connected with _t ′ ⁿ The nth position-channel fusion characteristic H output by the nth parallel attention module is obtained through linear addition and input convolution layer processing _t ⁿ ；

Step 2.2.4, the detail processing module pair O _t ⁿ Processing the obtained nth detail processing feature SC _t ⁿ ；

Step 2.2.5 nth position-channel fusion featureSign H _t ⁿ With the nth detail processing feature SC _t ⁿ Adding to obtain an nth information enhancement feature HSC output by an nth feature processing block _t ⁿ The method comprises the steps of carrying out a first treatment on the surface of the Thereby obtaining a t-th segment information enhancement feature sequence HSC output by the feature enhancement and detail processing module _t ＝{HSC _t ¹ ,HSC _t ² ,...,HSC _t ⁿ ,...,HSC _t ^N }；

Step 2.3, the feature decoding module consists of N deconvolution modules and a Sigmoid layer, wherein each deconvolution module consists of a deconvolution layer, a BN layer, a convolution layer, a BN layer and a ReLU layer which are connected in series;

enhancing the Nth information feature HSC _t ^N Sending the result to an Nth deconvolution module for processing to obtain a result and HSC _t ^N-1 Linear addition, and then sending the added result to the N-1 deconvolution module for processing to obtain a result and HSC _t ^N-2 Linear addition, and so on, the final added result is sent to the 1 st deconvolution module for processing, and the obtained result is sent to the Sigmoid layer for processing, thereby obtaining the t-section multi-information video sequence C _t Predicted pixel-by-pixel foreground probability

Step 3, training a network:

step 3.1, establishing a prediction foreground probability by using the formula (1)And pixel level label Y _t T th loss between

In the formula (1), e is a smoothing parameter, m, n is a spatial pixel position;

step 3.2, T-segment based Multi-information video sequence { C } _t T=1, 2, …, T }, T losses are done with Adam optimizerIn the multi-scale expansion convolution encoding-decoding network, continuously updating network parameters until the loss function tends to converge, and obtaining a trained multi-scale expansion convolution encoding-decoding model;

step 4, processing the intermittent object moving image to be predicted by using the trained multi-scale expansion convolution coding-decoding model to obtain the pixel-by-pixel foreground probability corresponding to the intermittent object moving image;

setting a threshold value P, comparing the pixel-by-pixel foreground probability of each pixel with the threshold value P, thereby setting the pixels larger than the threshold value P as foreground pixels, setting the rest pixels as background pixels, and obtaining a moving object segmentation result in the intermittent object moving image to be predicted.

The moving object detection method based on multi-scale expansion convolution encoding-decoding of the present invention is also characterized in that the step 2.2.1 includes:

will n-th multiscale encoded feature O _t ⁿ Inputting the data into an nth parallel attention module corresponding to an nth feature processing block, and using a pooling scale as l by an nth position attention unit ₁ ,l ₂ ,…,l _r ,…,l _L Is respectively corresponding to O by an adaptive pooling layer and a convolution layer _t ⁿ Processing to obtain L pooling features R _t ¹ ,R _t ² ,...,R _t ^r ,...,R _t ^L And dimension-reducing feature F ₀ The method comprises the steps of carrying out a first treatment on the surface of the Wherein l _r Represents the pooling scale of the R-th adaptive pooling layer, R _t ^r Represents O _t ⁿ Pooling features obtained through the r-th adaptive pooling layer; and then to R _t ¹ ,R _t ² ,...,R _t ^r ,...,R _t ^L After cascade connection, the nth aggregation center is obtainedWherein C represents the number of channels of the nth aggregation center, J represents the sum of pooling scales of the L adaptive pooling layers, j=l ₁ +l ₂ +…+l _r +…+l _L ；

Will be the nth aggregation center W _t ⁿ Inputting the processed full connection layer in the nth position attention unit, and then combining with the dimension reduction feature F ₀ After multiplication, the obtained product is input into a Softmax layer for processing, and an nth position attention S is obtained _t ⁿ The method comprises the steps of carrying out a first treatment on the surface of the Will S _t ⁿ And multiscale encoding feature O _t ⁿ After multiplication, the nth position fusion feature E output by the nth parallel attention module is obtained _t ⁿ 。

The step 2.2.2 comprises:

will n-th multiscale encoded feature O _t ⁿ Input into the nth parallel attention module corresponding to the nth feature processing block, and the nth channel attention unit uses a layer of convolution layer to pair O _t ⁿ After the dimension reduction, the mixture is combined with O _t ⁿ Multiplication is carried out on the transpose of the (2), and the obtained product is processed by a Softmax layer to obtain an nth channel attention diagram U _t ⁿ The method comprises the steps of carrying out a first treatment on the surface of the U is set to _t ⁿ And multiscale encoding feature O _t ⁿ After multiplication, the nth channel fusion characteristic E 'output by the nth channel attention unit is obtained' _t ⁿ 。

The step 2.2.4 includes:

the detail processing module is formed by connecting an edge detail branch and a context information branch in parallel, and an nth edge detail branch in the detail processing module corresponding to an nth feature processing block uses a Sobel operator to code an nth multi-scale coding feature O _t ⁿ After processing in the horizontal and vertical directions, obtaining two results, performing linear addition, and inputting into a convolution layer for processing, thereby obtaining an nth edge feature SB _t ⁿ ；

Detail processing corresponding to nth feature processing blockThe context information branch in the module encodes feature O with n-th multiscale _t ⁿ After the processing of sequentially inputting the convolution layer and the Softmax layer, the obtained result is processed by the convolution layer and the activation function leak ReLU layer, thereby obtaining the nth context characteristic CB _t ⁿ ；

SB by channel dimension _t ⁿ And CB _t ⁿ After being spliced, the processed data are input into a convolution layer to obtain an nth detail processing characteristic SC output by an nth detail processing module _t ⁿ ；

The electronic device of the present invention includes a memory and a processor, wherein the memory is configured to store a program for supporting the processor to execute the moving object detection method, and the processor is configured to execute the program stored in the memory.

The invention relates to a computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the moving object detection method.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the invention, the expansion convolution is used for replacing the traditional convolution as the encoder convolution layer, so that the problem of information loss in the secondary sampling process is solved, the receptive field of the feature map is enlarged, and the model can learn comprehensive features and improve the detection performance by fusing information from different scales.

2. The invention designs a group of multi-branch mixed expansion convolution blocks with a plurality of different expansion values, avoids the grid effect of expansion convolution, maintains the continuity and the relativity of information, and improves the problem of the detection performance reduction of small objects caused by low receptive field of a deep learning network.

3. The novel background subtraction model provided by the invention can be easily deployed in invisible scenes, solves the problem that the traditional algorithm can only process simple scenes and the existing deep learning model can not be popularized to invisible scenes, solves the diversity and variability of real application scenes, and brings remarkable improvement to the background subtraction method.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a diagram of a network architecture based on multi-scale expansion convolutional encoding-decoding in accordance with the present invention;

FIG. 3a is a partial frame image of a partial video sequence in a test set used in accordance with the present invention;

FIG. 3b is a schematic diagram of a group of partial frame images in a video sequence in a test set according to the present invention;

FIG. 3c is a diagram of semantic information corresponding to partial frame images in a partial video sequence in a test set using a pre-trained HRnet network according to the present invention;

fig. 3d shows a binarized foreground segmentation map obtained using a multi-scale dilation convolutional encoding-decoding network in accordance with the present invention.

Detailed Description

In this embodiment, a moving object detection method based on multi-scale expansion convolutional encoding-decoding is that firstly, a pre-trained HRnet network is utilized to construct a multi-information video sequence set as input, then multi-scale context information is fused with a multi-scale expansion convolutional encoding module and a feature enhancement and detail module, foreground information is captured, and finally a foreground segmentation map of a moving object is obtained through a feature decoding module. As shown in fig. 1, the specific steps are as follows:

step 1, constructing a multi-information video sequence set;

step 1.1, selecting a T-section video sequence from video data with pixel-level labels, and carrying out normalization processing on each frame of image in the video sequence to obtain a normalized video sequence set S= { S ₁ ,S ₂ ,...,S _t ,...,S _T }，S _t Representing a video sequence of segment t, anf _i ^t For video sequence S of the t-th segment _t M is the number of image frames in each video sequence; definition of a t-th segment video sequenceThe true pixel level label is denoted as Y _t ∈{0,1}；

In this embodiment, the video sequence T is set to 49, and the number M of image frames of each video sequence is 1000, but is not limited to this value. Training and testing sets were made using dataset CDnet-2014, which contained 10 categories, 49 video scenes containing multiple challenges, including: dynamic background, shadows, bad weather, low frame rates, intermittent object motion, turbulence, etc.; one video is extracted for each category in the dataset to create a test set containing only unseen videos, while the remaining videos are used for training.

Step 1.2, respectively toProcessing the previous X frame and the previous Y frame images to obtain a t-th segment ' blank ' video sequence S ' _t And the t-th nearest video sequence S _t ；1<X<Y<M; in particular embodiments, x=50, y=100, but is not limited to this value.

In this embodiment, HRnet network is used to pretrain on ADE 20K dataset, wherein the dataset contains class C E { C ₀ ,c ₁ ,...,c ₁₄₉ Using 12 kinds of people, automobiles, trucks and the like in the data set as the foreground and the rest as the background, since the HRnet network provides c _j Pixel-by-pixel probability p of C _c Let I [ m, n ]]Is the input frame at spatial position m, n,is I [ m, n ]]Is calculated by +.>Obtaining respective semantic information video sequencesA column; where Q represents a set of 12 classes of foreground classes.

Step 1.4, constructing the t-th segment of the multi-information video sequence C _t ＝{S _t ,S′ _t ,S″ _t ,F _t ,F′ _t ,F″ _t }；

step 2.1, as shown in the left part of fig. 2, constructing a multi-scale expansion convolution coding module, wherein the multi-scale expansion convolution coding module is formed by connecting N coding blocks in series, and each coding block is formed by connecting one expansion convolution block and one convolution block; the expansion convolution block includes: a parallel convolution layer with different expansion rates; the outputs of the a-layer convolution layers are added and then are input into a subsequent convolution block after being processed by a Dropout layer; the convolution block is formed by serially connecting a convolution layer, a BN layer and a ReLU layer in sequence;

t-th segment multiple information video sequence C _t Processing the input multi-scale expansion convolution coding module to obtain a t-th section multi-scale coding characteristic sequence O output by N coding blocks _t ＝{O _t ¹ ,O _t ² ,...,O _t ⁿ ,...,O _t ^N -a }; wherein O is _t ⁿ Representing the multi-scale coding features of the nth coding block output;

in this embodiment, n=3, where the kernel sizes of the four parallel convolution layers in the expansion convolution block are all 3×3, the expansion rates are 1, 3, 5, and 7, and the actual receptive field sizes obtained by expansion are 3×3, 7×7, 11×11, and 15×15, respectively. The method solves the problem of information loss in the secondary sampling process. The dilation convolution inserts zeros between the standard convolution kernel parameters to expand the size of the convolution kernel and increase the receptive field without increasing the amount of network parameters. A convolution layer with a kernel size of 1×1 and a step size s=2 is used in the convolution block.

Step 2.2, as shown in the middle part of fig. 2, the constructed feature enhancement and detail processing module is formed by connecting N feature processing blocks in parallel; each characteristic processing block is formed by connecting a parallel attention module and a detail processing module in parallel; the parallel attention module is formed by connecting a position attention unit and a channel attention unit in parallel, and the detail processing module is formed by connecting an edge detail branch and a context information branch in parallel;

step 2.2.1, processing of the position attention unit:

Will be the nth aggregation center W _t ⁿ Inputting the processed full connection layer in the nth position attention unit, and then combining with the dimension reduction feature F ₀ After multiplication, the obtained product is input into a Softmax layer for processing, and an nth position attention S is obtained _t ⁿ The method comprises the steps of carrying out a first treatment on the surface of the Will S _t ⁿ And multiscale encoding feature O _t ⁿ After multiplication, the nth position fusion feature E output by the nth parallel attention module is obtained _t ⁿ ；

In this embodiment, L is set to 4, L ₁ ,l ₂ ,l ₃ ,l ₄ The convolution layer kernel sizes are respectively set to be 1,2, 3 and 6, the convolution layer kernel sizes are 1 multiplied by 1, and the step size is 1. Multiscale aggregation centers having different contexts are captured by a multiscale pooling operation and the spatial perceptibility of each spatial pixel and multiscale aggregation center is enhanced using weighting of the relational perceptions of each center in the multiscale aggregation center.

Step 2.2.2, processing of the channel attention unit:

will n-th multiscale encoded feature O _t ⁿ Input into the nth parallel attention module corresponding to the nth feature processing block, and the nth channel attention unit uses a layer of convolution layer to pair O _t ⁿ After the dimension reduction, the mixture is combined with O _t ⁿ Multiplication is carried out on the transpose of the (2), and the obtained product is processed by a Softmax layer to obtain an nth channel attention diagram U _t ⁿ The method comprises the steps of carrying out a first treatment on the surface of the U is set to _t ⁿ And multiscale encoding feature O _t ⁿ After multiplication, the nth channel fusion characteristic E output by the nth channel attention unit is obtained _t ′ ⁿ ；

In this embodiment, the size of the kernel of the convolution layer is 1×1, and the step size is 1. The spatial information understanding capability of the network is enhanced by modeling the channel correlation by using the spatial information of all corresponding positions, thereby establishing a relationship between each input channel and the channel aggregation center.

In this embodiment, the size of the kernel of the convolution layer is 3×3, and the step size is 1. The connection between the channel and the spatial feature information is enhanced by using the attention mechanism, the weight of the background information is reduced, the purposes of suppressing the background information and focusing the foreground information are achieved, and therefore the feature information extraction capability is improved.

Step 2.2.4, processing by a detail processing module:

in the detail processing module corresponding to the nth feature processing blockThe nth edge detail branch uses Sobel operator to code the nth multi-scale code feature O _t ⁿ After processing in the horizontal and vertical directions, obtaining two results, performing linear addition, and inputting into a convolution layer for processing, thereby obtaining an nth edge feature SB _t ⁿ ；

The context information branch in the detail processing module corresponding to the nth feature processing block encodes the nth multi-scale coding feature O _t ⁿ After the processing of sequentially inputting the convolution layer and the Softmax layer, the obtained result is processed by the convolution layer and the activation function LeakyReLU layer, thereby obtaining the nth context characteristic CB _t ⁿ The method comprises the steps of carrying out a first treatment on the surface of the In a specific implementation, the convolution layers used by the edge detail branch and the context information branch are kernel sizes of 1×1, and step sizes are 1.

Step 2.2.5 SB is processed according to channel dimension _t ⁿ And CB _t ⁿ After being spliced, the processed data are input into a convolution layer to obtain an nth detail processing characteristic SC output by an nth detail processing module _t ⁿ The method comprises the steps of carrying out a first treatment on the surface of the Utilizing the edge detail branch to enhance the texture of the component, and capturing the long-range dependency global enhancement component information by the context information branch; in a specific implementation, the size of the kernel of the convolution layer is 1×1, and the step size is 1.

Step 2.2.6 nth position-channel fusion feature H _t ⁿ With the nth detail processing feature SC _t ⁿ Adding to obtain an nth information enhancement feature HSC output by an nth feature processing block _t ⁿ The method comprises the steps of carrying out a first treatment on the surface of the Thereby obtaining a t-th segment information enhancement feature sequence HSC output by the feature enhancement and detail processing module _t ＝{HSC _t ¹ ,HSC _t ² ,...,HSC _t ⁿ ,...,HSC _t ^N -a }; capturing remote spatial dependencies using location-channel fusion features and detail processing features while preserving accurate location information to enhance feature representations of interest;

step 2.3, as shown in the right part of fig. 2, the feature decoding module is composed of N deconvolution modules and Sigmoid layers, and each deconvolution module is composed of a deconvolution layer, a BN layer, a convolution layer, a BN layer and a ReLU layer in series;

enhancing the Nth information feature HSC _t ^N Sending the result to an Nth deconvolution module for processing to obtain a result and HSC _t ^N-1 Linear addition, sending the added result into the N-1 deconvolution module for processing, and obtaining the result and HSC _t ^N-2 Linear addition, and so on, sending the final added result to the 1 st deconvolution module for processing, and sending the obtained result to the Sigmoid layer for processing to obtain the t-th segment multi-information video sequence C _t Predicted pixel-by-pixel foreground probabilityIn a specific implementation, n=3 is set. The size of the kernel in the deconvolution layer is 3×3, the step size is 2, the size of the kernel in the deconvolution layer is 3×3, and the step size is 1. The input image resolution is restored step by using a feature decoding module.

Step 3, training a network:

step 3.1, in order to solve the problem of unbalanced number of background pixels and foreground pixels, adopting Jaccard index as t-th loss function, and establishing prediction foreground probability by using formula (1)And pixel level label Y _t Loss between->

step 3.2, T-segment based Multi-information video sequence { C } _t T=1, 2,..Counter-propagating in a multi-scale expansion convolutional encoding-decoding network, and continuously updating network parameters until the loss function +.>Tending to converge to obtain a trained multi-scale expansion convolution coding-decoding model; in a specific implementation, the number of samples taken for one training is 16. The learning rate of Adam optimizer was set to 0.0001, and the betas was set to 0.9 and 0.999. When the training period is 150, the loss function tends to converge, and the optimal multi-scale expansion convolution encoding-decoding model is stored.

In this example, the threshold P is set to 0.5, the pixels with values greater than 0.5 are set as foreground pixels, the remaining pixels are set as background pixels, and the resulting test set partial binarized foreground segmentation map is shown in fig. 3 a. The overall quantitative results of the test set are shown in table 1.

In this embodiment, an electronic device includes a memory for storing a program supporting the processor to execute the above method, and a processor configured to execute the program stored in the memory.

In this embodiment, a computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the method described above.

TABLE 1 ensemble averaging Precision, recall, F-Measure results for different categories in a dataset of the present invention

Epoch (optimum period)	Precision	Recall	F-Measure
				150	0.924	0.919	0.920

As shown in fig. 3c, the use of the pre-trained HRnet network to obtain the semantic information video sequence of the data set as input can strengthen the network's attention to the moving object, and reduce false detection caused by shadows and the like. Comparing the binarized foreground segmentation map shown in fig. 3d with the pixel-level label shown in fig. 3b, it can be seen that the multi-scale expansion convolution encoding-decoding network provided by the invention has good robustness and can be well adapted to various challenges in the real world, especially for small target motion detection.

Claims

1. The moving object detection method based on multi-scale expansion convolution encoding-decoding is characterized by comprising the following steps of:

step 1, constructing a multi-information video sequence set;

step 1.1, selecting a T-section video sequence from video data with pixel-level labels, and carrying out normalization processing on each frame of image in the video sequence to obtain a normalized video sequence set S= { S ₁ ,S ₂ ,...,S _t ,...,S _T }，S _t Representing the t-th section viewFrequency sequence, andf _i ^t for video sequence S of the t-th segment _t M is the number of image frames in each video sequence; the true pixel level label defining the t-th segment video sequence is denoted Y _t ∈{0,1}；

Step 1.4, constructing the t-th segment of the multi-information video sequence C _t ＝{S _t ,S′ _t ,S″ _t ,F _t ,F _t ′,F _t ″}；

the t-th segment of the multi-information video sequence C _t Processing the input multi-scale expansion convolution coding module to obtain the t-th section multi-scale output by N coding blocksCoding feature sequence O _t ＝{O _t ¹ ,O _t ² ,...,O _t ⁿ ,...,O _t ^N -a }; wherein O is _t ⁿ Representing the multi-scale coding features of the nth coding block output;

Step 2.2.2 nth channel attention unit pair O _t ⁿ Processing to obtain the nth channel fusion characteristic

Step 2.2.3, E _t ⁿ And (3) withThe nth position-channel fusion characteristic H output by the nth parallel attention module is obtained through linear addition and input convolution layer processing _t ⁿ ；

Step 2.2.5 nth position-channel fusion feature H _t ⁿ With the nth detail processing feature SC _t ⁿ Adding to obtain an nth information enhancement feature HSC output by an nth feature processing block _t ⁿ The method comprises the steps of carrying out a first treatment on the surface of the Thereby obtaining a t-th segment information enhancement feature sequence HSC output by the feature enhancement and detail processing module _t ＝{HSC _t ¹ ,HSC _t ² ,...,HSC _t ⁿ ,...,HSC _t ^N }；

Step 3, training a network:

step 3.1, establishing a prediction foreground probability by using the formula (1)And pixel level label Y _t T-th loss between>

step 3.2, T-segment based Multi-information video sequence { C } _t T=1, 2,..Counter-propagating in a multi-scale expansion convolutional encoding-decoding network, and continuously updating network parameters until the loss function tends toConverging to obtain a trained multi-scale expansion convolution coding-decoding model;

2. The moving object detection method based on multi-scale expansion convolutional encoding-decoding according to claim 1, wherein the step 2.2.1 comprises:

3. The moving object detection method based on multi-scale expansion convolutional encoding-decoding according to claim 2, wherein the step 2.2.2 comprises:

will n-th multiscale encoded feature O _t ⁿ Input into the nth parallel attention module corresponding to the nth feature processing block, and the nth channel attention unit uses a layer of convolution layer to pair O _t ⁿ After the dimension reduction, the mixture is combined with O _t ⁿ Multiplication is carried out on the transpose of the (2), and the obtained product is processed by a Softmax layer to obtain an nth channel attention diagram U _t ⁿ The method comprises the steps of carrying out a first treatment on the surface of the U is set to _t ⁿ And multiscale encoding feature O _t ⁿ After multiplication, the nth channel fusion characteristic output by the nth channel attention unit is obtained

4. The moving object detection method based on multi-scale expansion convolutional encoding-decoding according to claim 3, wherein the step 2.2.4 comprises:

The context information branch in the detail processing module corresponding to the nth feature processing block encodes the nth multi-scale coding feature O _t ⁿ After the processing of sequentially inputting the convolution layer and the Softmax layer, the obtained result is processed by the convolution layer and the activation function leak ReLU layer, thereby obtaining the nth context characteristic CB _t ⁿ ；

SB by channel dimension _t ⁿ And CB _t ⁿ After being spliced, the processed data are input into a convolution layer to obtain an nth detail processing characteristic SC output by an nth detail processing module _t ⁿ 。

5. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that supports the processor to execute the moving object detection method according to any one of claims 1 to 4, the processor being configured to execute the program stored in the memory.

6.A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor performs the steps of the moving object detection method according to any of claims 1-4.