CN117197183A - Moving object detection method based on multi-scale expansion convolution encoding-decoding - Google Patents

Moving object detection method based on multi-scale expansion convolution encoding-decoding Download PDF

Info

Publication number
CN117197183A
CN117197183A CN202311179071.8A CN202311179071A CN117197183A CN 117197183 A CN117197183 A CN 117197183A CN 202311179071 A CN202311179071 A CN 202311179071A CN 117197183 A CN117197183 A CN 117197183A
Authority
CN
China
Prior art keywords
nth
processing
feature
layer
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311179071.8A
Other languages
Chinese (zh)
Inventor
杨依忠
夏婷婷
张景润
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202311179071.8A priority Critical patent/CN117197183A/en
Publication of CN117197183A publication Critical patent/CN117197183A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a moving target detection method based on multi-scale expansion convolution encoding-decoding, which comprises the following steps: 1. constructing a multi-information video sequence set, 2, constructing an encoding-decoding model with multi-scale expansion convolution, 3, training a network model, 4, and testing the network model. The invention can solve the problems that the detail features in the traditional encoder-decoder can not transmit deeper layers, can only detect simple scenes, and the like, thereby being capable of rapidly and accurately extracting foreground targets in real complex scenes, especially small target object detection influenced by background information, and further improving the detection capability of the foreground targets.

Description

Moving object detection method based on multi-scale expansion convolution encoding-decoding
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a moving target detection method based on multi-scale expansion convolution coding-decoding.
Background
The moving target detection technology is a hot topic discussed in the field of computer vision, and plays an important role in various fields such as human motion analysis, anomaly detection, man-machine interaction, robot navigation and the like. Detecting locally changing foreground objects from a video scene has been a challenging task due to the effects of light changes, shadows, dynamic background, etc. on the video. One of the widely used methods of extracting foreground objects is to utilize a background subtraction method.
The traditional algorithm is unsupervised, relies on modeling of a background model to distinguish moving objects, is easy to be interfered by scene factors, can only process scenes with simple backgrounds, shows high scene dependence, and has not yet reached the standard for detecting unseen videos irrelevant to training videos.
Recently, the deep learning-based algorithm exhibits excellent scene learning ability, and detection accuracy is far superior to that of the conventional moving object detection method, in which the encoder-decoder structure is most used, but the method is prone to information loss and problems of degradation of detection performance with respect to small objects due to low receptive field during the sub-sampling process.
Disclosure of Invention
The invention aims to solve the defects of the prior art, and provides a moving target detection method based on multi-scale expansion convolution encoding-decoding, so as to solve the problems that detailed features in a traditional encoder-decoder cannot be transmitted deeper, only simple scenes can be detected and the like, thereby being capable of rapidly and accurately extracting a foreground target in a real complex scene, particularly detecting a small target object influenced by background information, and further improving the detection capability of the foreground target.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
the invention discloses a moving target detection method based on multi-scale expansion convolution coding-decoding, which is characterized by comprising the following steps:
step 1, constructing a multi-information video sequence set;
step 1.1, selecting a T-section video sequence from video data with pixel-level labels, and carrying out normalization processing on each frame of image in the video sequence to obtain a normalized video sequence set S= { S 1 ,S 2 ,...,S t ,...,S T },S t Representing a video sequence of segment t, anf i t For video sequence S of the t-th segment t M is the number of image frames in each video sequence; the true pixel level label defining the t-th segment video sequence is denoted Y t ∈{0,1};
Step 1.2, respectively toProcessing the previous X frame and the previous Y frame images to obtain a t-th segment ' blank ' video sequence S ' t And the t-th nearest video sequence S t ;1<X<Y<M;
Step 1.3, respectively performing S-layer training by using the pre-trained HRnet network t ,S′ t ,S″ t Processing to obtain t-th segment original semantic information video sequence F t T-th segment 'null' semantic information video sequence F t ' t-th segment nearest semantic information video sequence F t ″;
Step 1.4, constructing the t-th segment of the multi-information video sequence C t ={S t ,S′ t ,S″ t ,F t ,F′ t ′,F″ t };
Step 2, constructing a multi-scale expansion convolution encoding-decoding network, which comprises the following steps: the device comprises a multi-scale expansion convolutional coding module, a characteristic enhancement and detail processing module and a characteristic decoding module;
step 2.1, constructing the multi-scale expansion convolution coding module, wherein the multi-scale expansion convolution coding module is formed by connecting N coding blocks in series, and each coding block is formed by connecting one expansion convolution block and one convolution block; the expansion convolution block includes: a parallel convolution layer with different expansion rates; the outputs of the a-layer convolution layers are added and then are input into a subsequent convolution block after being processed by a Dropout layer; the convolution block is formed by serially connecting a convolution layer, a BN layer and a ReLU layer in sequence;
the t-th segment of the multi-information video sequence C t Processing the input multi-scale expansion convolution coding module to obtain a t-th section multi-scale coding characteristic sequence O output by N coding blocks t ={O t 1 ,O t 2 ,...,O t n ,...,O t N -a }; wherein O is t n Representing the multi-scale coding features of the nth coding block output;
step 2.2, constructing the characteristic enhancement and detail processing module, wherein the characteristic enhancement and detail processing module is formed by connecting N characteristic processing blocks in parallel; each characteristic processing block is formed by connecting a parallel attention module and a detail processing module in parallel; the parallel attention module is formed by connecting a position attention unit and a channel attention unit in parallel;
step 2.2.1 parallel attention module position attention unit pair O in nth feature processing block t n Processing to obtain nth position fusion feature E t n
Step 2.2.2 nth channel attention unit pair O t n Processing to obtain nth channel fusion feature E tn
Step 2.2.3, E t n And E is connected with tn The nth position-channel fusion characteristic H output by the nth parallel attention module is obtained through linear addition and input convolution layer processing t n
Step 2.2.4, the detail processing module pair O t n Processing the obtained nth detail processing feature SC t n
Step 2.2.5 nth position-channel fusion featureSign H t n With the nth detail processing feature SC t n Adding to obtain an nth information enhancement feature HSC output by an nth feature processing block t n The method comprises the steps of carrying out a first treatment on the surface of the Thereby obtaining a t-th segment information enhancement feature sequence HSC output by the feature enhancement and detail processing module t ={HSC t 1 ,HSC t 2 ,...,HSC t n ,...,HSC t N };
Step 2.3, the feature decoding module consists of N deconvolution modules and a Sigmoid layer, wherein each deconvolution module consists of a deconvolution layer, a BN layer, a convolution layer, a BN layer and a ReLU layer which are connected in series;
enhancing the Nth information feature HSC t N Sending the result to an Nth deconvolution module for processing to obtain a result and HSC t N-1 Linear addition, and then sending the added result to the N-1 deconvolution module for processing to obtain a result and HSC t N-2 Linear addition, and so on, the final added result is sent to the 1 st deconvolution module for processing, and the obtained result is sent to the Sigmoid layer for processing, thereby obtaining the t-section multi-information video sequence C t Predicted pixel-by-pixel foreground probability
Step 3, training a network:
step 3.1, establishing a prediction foreground probability by using the formula (1)And pixel level label Y t T th loss between
In the formula (1), e is a smoothing parameter, m, n is a spatial pixel position;
step 3.2, T-segment based Multi-information video sequence { C } t T=1, 2, …, T }, T losses are done with Adam optimizerIn the multi-scale expansion convolution encoding-decoding network, continuously updating network parameters until the loss function tends to converge, and obtaining a trained multi-scale expansion convolution encoding-decoding model;
step 4, processing the intermittent object moving image to be predicted by using the trained multi-scale expansion convolution coding-decoding model to obtain the pixel-by-pixel foreground probability corresponding to the intermittent object moving image;
setting a threshold value P, comparing the pixel-by-pixel foreground probability of each pixel with the threshold value P, thereby setting the pixels larger than the threshold value P as foreground pixels, setting the rest pixels as background pixels, and obtaining a moving object segmentation result in the intermittent object moving image to be predicted.
The moving object detection method based on multi-scale expansion convolution encoding-decoding of the present invention is also characterized in that the step 2.2.1 includes:
will n-th multiscale encoded feature O t n Inputting the data into an nth parallel attention module corresponding to an nth feature processing block, and using a pooling scale as l by an nth position attention unit 1 ,l 2 ,…,l r ,…,l L Is respectively corresponding to O by an adaptive pooling layer and a convolution layer t n Processing to obtain L pooling features R t 1 ,R t 2 ,...,R t r ,...,R t L And dimension-reducing feature F 0 The method comprises the steps of carrying out a first treatment on the surface of the Wherein l r Represents the pooling scale of the R-th adaptive pooling layer, R t r Represents O t n Pooling features obtained through the r-th adaptive pooling layer; and then to R t 1 ,R t 2 ,...,R t r ,...,R t L After cascade connection, the nth aggregation center is obtainedWherein C represents the number of channels of the nth aggregation center, J represents the sum of pooling scales of the L adaptive pooling layers, j=l 1 +l 2 +…+l r +…+l L
Will be the nth aggregation center W t n Inputting the processed full connection layer in the nth position attention unit, and then combining with the dimension reduction feature F 0 After multiplication, the obtained product is input into a Softmax layer for processing, and an nth position attention S is obtained t n The method comprises the steps of carrying out a first treatment on the surface of the Will S t n And multiscale encoding feature O t n After multiplication, the nth position fusion feature E output by the nth parallel attention module is obtained t n
The step 2.2.2 comprises:
will n-th multiscale encoded feature O t n Input into the nth parallel attention module corresponding to the nth feature processing block, and the nth channel attention unit uses a layer of convolution layer to pair O t n After the dimension reduction, the mixture is combined with O t n Multiplication is carried out on the transpose of the (2), and the obtained product is processed by a Softmax layer to obtain an nth channel attention diagram U t n The method comprises the steps of carrying out a first treatment on the surface of the U is set to t n And multiscale encoding feature O t n After multiplication, the nth channel fusion characteristic E 'output by the nth channel attention unit is obtained' t n
The step 2.2.4 includes:
the detail processing module is formed by connecting an edge detail branch and a context information branch in parallel, and an nth edge detail branch in the detail processing module corresponding to an nth feature processing block uses a Sobel operator to code an nth multi-scale coding feature O t n After processing in the horizontal and vertical directions, obtaining two results, performing linear addition, and inputting into a convolution layer for processing, thereby obtaining an nth edge feature SB t n
Detail processing corresponding to nth feature processing blockThe context information branch in the module encodes feature O with n-th multiscale t n After the processing of sequentially inputting the convolution layer and the Softmax layer, the obtained result is processed by the convolution layer and the activation function leak ReLU layer, thereby obtaining the nth context characteristic CB t n
SB by channel dimension t n And CB t n After being spliced, the processed data are input into a convolution layer to obtain an nth detail processing characteristic SC output by an nth detail processing module t n
The electronic device of the present invention includes a memory and a processor, wherein the memory is configured to store a program for supporting the processor to execute the moving object detection method, and the processor is configured to execute the program stored in the memory.
The invention relates to a computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the moving object detection method.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the invention, the expansion convolution is used for replacing the traditional convolution as the encoder convolution layer, so that the problem of information loss in the secondary sampling process is solved, the receptive field of the feature map is enlarged, and the model can learn comprehensive features and improve the detection performance by fusing information from different scales.
2. The invention designs a group of multi-branch mixed expansion convolution blocks with a plurality of different expansion values, avoids the grid effect of expansion convolution, maintains the continuity and the relativity of information, and improves the problem of the detection performance reduction of small objects caused by low receptive field of a deep learning network.
3. The novel background subtraction model provided by the invention can be easily deployed in invisible scenes, solves the problem that the traditional algorithm can only process simple scenes and the existing deep learning model can not be popularized to invisible scenes, solves the diversity and variability of real application scenes, and brings remarkable improvement to the background subtraction method.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
FIG. 2 is a diagram of a network architecture based on multi-scale expansion convolutional encoding-decoding in accordance with the present invention;
FIG. 3a is a partial frame image of a partial video sequence in a test set used in accordance with the present invention;
FIG. 3b is a schematic diagram of a group of partial frame images in a video sequence in a test set according to the present invention;
FIG. 3c is a diagram of semantic information corresponding to partial frame images in a partial video sequence in a test set using a pre-trained HRnet network according to the present invention;
fig. 3d shows a binarized foreground segmentation map obtained using a multi-scale dilation convolutional encoding-decoding network in accordance with the present invention.
Detailed Description
In this embodiment, a moving object detection method based on multi-scale expansion convolutional encoding-decoding is that firstly, a pre-trained HRnet network is utilized to construct a multi-information video sequence set as input, then multi-scale context information is fused with a multi-scale expansion convolutional encoding module and a feature enhancement and detail module, foreground information is captured, and finally a foreground segmentation map of a moving object is obtained through a feature decoding module. As shown in fig. 1, the specific steps are as follows:
step 1, constructing a multi-information video sequence set;
step 1.1, selecting a T-section video sequence from video data with pixel-level labels, and carrying out normalization processing on each frame of image in the video sequence to obtain a normalized video sequence set S= { S 1 ,S 2 ,...,S t ,...,S T },S t Representing a video sequence of segment t, anf i t For video sequence S of the t-th segment t M is the number of image frames in each video sequence; definition of a t-th segment video sequenceThe true pixel level label is denoted as Y t ∈{0,1};
In this embodiment, the video sequence T is set to 49, and the number M of image frames of each video sequence is 1000, but is not limited to this value. Training and testing sets were made using dataset CDnet-2014, which contained 10 categories, 49 video scenes containing multiple challenges, including: dynamic background, shadows, bad weather, low frame rates, intermittent object motion, turbulence, etc.; one video is extracted for each category in the dataset to create a test set containing only unseen videos, while the remaining videos are used for training.
Step 1.2, respectively toProcessing the previous X frame and the previous Y frame images to obtain a t-th segment ' blank ' video sequence S ' t And the t-th nearest video sequence S t ;1<X<Y<M; in particular embodiments, x=50, y=100, but is not limited to this value.
Step 1.3, respectively performing S-layer training by using the pre-trained HRnet network t ,S′ t ,S″ t Processing to obtain t-th segment original semantic information video sequence F t T-th segment 'null' semantic information video sequence F t ' t-th segment nearest semantic information video sequence F t ″;
In this embodiment, HRnet network is used to pretrain on ADE 20K dataset, wherein the dataset contains class C E { C 0 ,c 1 ,...,c 149 Using 12 kinds of people, automobiles, trucks and the like in the data set as the foreground and the rest as the background, since the HRnet network provides c j Pixel-by-pixel probability p of C c Let I [ m, n ]]Is the input frame at spatial position m, n,is I [ m, n ]]Is calculated by +.>Obtaining respective semantic information video sequencesA column; where Q represents a set of 12 classes of foreground classes.
Step 1.4, constructing the t-th segment of the multi-information video sequence C t ={S t ,S′ t ,S″ t ,F t ,F′ t ,F″ t };
Step 2, constructing a multi-scale expansion convolution encoding-decoding network, which comprises the following steps: the device comprises a multi-scale expansion convolutional coding module, a characteristic enhancement and detail processing module and a characteristic decoding module;
step 2.1, as shown in the left part of fig. 2, constructing a multi-scale expansion convolution coding module, wherein the multi-scale expansion convolution coding module is formed by connecting N coding blocks in series, and each coding block is formed by connecting one expansion convolution block and one convolution block; the expansion convolution block includes: a parallel convolution layer with different expansion rates; the outputs of the a-layer convolution layers are added and then are input into a subsequent convolution block after being processed by a Dropout layer; the convolution block is formed by serially connecting a convolution layer, a BN layer and a ReLU layer in sequence;
t-th segment multiple information video sequence C t Processing the input multi-scale expansion convolution coding module to obtain a t-th section multi-scale coding characteristic sequence O output by N coding blocks t ={O t 1 ,O t 2 ,...,O t n ,...,O t N -a }; wherein O is t n Representing the multi-scale coding features of the nth coding block output;
in this embodiment, n=3, where the kernel sizes of the four parallel convolution layers in the expansion convolution block are all 3×3, the expansion rates are 1, 3, 5, and 7, and the actual receptive field sizes obtained by expansion are 3×3, 7×7, 11×11, and 15×15, respectively. The method solves the problem of information loss in the secondary sampling process. The dilation convolution inserts zeros between the standard convolution kernel parameters to expand the size of the convolution kernel and increase the receptive field without increasing the amount of network parameters. A convolution layer with a kernel size of 1×1 and a step size s=2 is used in the convolution block.
Step 2.2, as shown in the middle part of fig. 2, the constructed feature enhancement and detail processing module is formed by connecting N feature processing blocks in parallel; each characteristic processing block is formed by connecting a parallel attention module and a detail processing module in parallel; the parallel attention module is formed by connecting a position attention unit and a channel attention unit in parallel, and the detail processing module is formed by connecting an edge detail branch and a context information branch in parallel;
step 2.2.1, processing of the position attention unit:
will n-th multiscale encoded feature O t n Inputting the data into an nth parallel attention module corresponding to an nth feature processing block, and using a pooling scale as l by an nth position attention unit 1 ,l 2 ,…,l r ,…,l L Is respectively corresponding to O by an adaptive pooling layer and a convolution layer t n Processing to obtain L pooling features R t 1 ,R t 2 ,...,R t r ,...,R t L And dimension-reducing feature F 0 The method comprises the steps of carrying out a first treatment on the surface of the Wherein l r Represents the pooling scale of the R-th adaptive pooling layer, R t r Represents O t n Pooling features obtained through the r-th adaptive pooling layer; and then to R t 1 ,R t 2 ,...,R t r ,...,R t L After cascade connection, the nth aggregation center is obtainedWherein C represents the number of channels of the nth aggregation center, J represents the sum of pooling scales of the L adaptive pooling layers, j=l 1 +l 2 +…+l r +…+l L
Will be the nth aggregation center W t n Inputting the processed full connection layer in the nth position attention unit, and then combining with the dimension reduction feature F 0 After multiplication, the obtained product is input into a Softmax layer for processing, and an nth position attention S is obtained t n The method comprises the steps of carrying out a first treatment on the surface of the Will S t n And multiscale encoding feature O t n After multiplication, the nth position fusion feature E output by the nth parallel attention module is obtained t n
In this embodiment, L is set to 4, L 1 ,l 2 ,l 3 ,l 4 The convolution layer kernel sizes are respectively set to be 1,2, 3 and 6, the convolution layer kernel sizes are 1 multiplied by 1, and the step size is 1. Multiscale aggregation centers having different contexts are captured by a multiscale pooling operation and the spatial perceptibility of each spatial pixel and multiscale aggregation center is enhanced using weighting of the relational perceptions of each center in the multiscale aggregation center.
Step 2.2.2, processing of the channel attention unit:
will n-th multiscale encoded feature O t n Input into the nth parallel attention module corresponding to the nth feature processing block, and the nth channel attention unit uses a layer of convolution layer to pair O t n After the dimension reduction, the mixture is combined with O t n Multiplication is carried out on the transpose of the (2), and the obtained product is processed by a Softmax layer to obtain an nth channel attention diagram U t n The method comprises the steps of carrying out a first treatment on the surface of the U is set to t n And multiscale encoding feature O t n After multiplication, the nth channel fusion characteristic E output by the nth channel attention unit is obtained tn
In this embodiment, the size of the kernel of the convolution layer is 1×1, and the step size is 1. The spatial information understanding capability of the network is enhanced by modeling the channel correlation by using the spatial information of all corresponding positions, thereby establishing a relationship between each input channel and the channel aggregation center.
Step 2.2.3, E t n And E is connected with tn The nth position-channel fusion characteristic H output by the nth parallel attention module is obtained through linear addition and input convolution layer processing t n
In this embodiment, the size of the kernel of the convolution layer is 3×3, and the step size is 1. The connection between the channel and the spatial feature information is enhanced by using the attention mechanism, the weight of the background information is reduced, the purposes of suppressing the background information and focusing the foreground information are achieved, and therefore the feature information extraction capability is improved.
Step 2.2.4, processing by a detail processing module:
in the detail processing module corresponding to the nth feature processing blockThe nth edge detail branch uses Sobel operator to code the nth multi-scale code feature O t n After processing in the horizontal and vertical directions, obtaining two results, performing linear addition, and inputting into a convolution layer for processing, thereby obtaining an nth edge feature SB t n
The context information branch in the detail processing module corresponding to the nth feature processing block encodes the nth multi-scale coding feature O t n After the processing of sequentially inputting the convolution layer and the Softmax layer, the obtained result is processed by the convolution layer and the activation function LeakyReLU layer, thereby obtaining the nth context characteristic CB t n The method comprises the steps of carrying out a first treatment on the surface of the In a specific implementation, the convolution layers used by the edge detail branch and the context information branch are kernel sizes of 1×1, and step sizes are 1.
Step 2.2.5 SB is processed according to channel dimension t n And CB t n After being spliced, the processed data are input into a convolution layer to obtain an nth detail processing characteristic SC output by an nth detail processing module t n The method comprises the steps of carrying out a first treatment on the surface of the Utilizing the edge detail branch to enhance the texture of the component, and capturing the long-range dependency global enhancement component information by the context information branch; in a specific implementation, the size of the kernel of the convolution layer is 1×1, and the step size is 1.
Step 2.2.6 nth position-channel fusion feature H t n With the nth detail processing feature SC t n Adding to obtain an nth information enhancement feature HSC output by an nth feature processing block t n The method comprises the steps of carrying out a first treatment on the surface of the Thereby obtaining a t-th segment information enhancement feature sequence HSC output by the feature enhancement and detail processing module t ={HSC t 1 ,HSC t 2 ,...,HSC t n ,...,HSC t N -a }; capturing remote spatial dependencies using location-channel fusion features and detail processing features while preserving accurate location information to enhance feature representations of interest;
step 2.3, as shown in the right part of fig. 2, the feature decoding module is composed of N deconvolution modules and Sigmoid layers, and each deconvolution module is composed of a deconvolution layer, a BN layer, a convolution layer, a BN layer and a ReLU layer in series;
enhancing the Nth information feature HSC t N Sending the result to an Nth deconvolution module for processing to obtain a result and HSC t N-1 Linear addition, sending the added result into the N-1 deconvolution module for processing, and obtaining the result and HSC t N-2 Linear addition, and so on, sending the final added result to the 1 st deconvolution module for processing, and sending the obtained result to the Sigmoid layer for processing to obtain the t-th segment multi-information video sequence C t Predicted pixel-by-pixel foreground probabilityIn a specific implementation, n=3 is set. The size of the kernel in the deconvolution layer is 3×3, the step size is 2, the size of the kernel in the deconvolution layer is 3×3, and the step size is 1. The input image resolution is restored step by using a feature decoding module.
Step 3, training a network:
step 3.1, in order to solve the problem of unbalanced number of background pixels and foreground pixels, adopting Jaccard index as t-th loss function, and establishing prediction foreground probability by using formula (1)And pixel level label Y t Loss between->
In the formula (1), e is a smoothing parameter, m, n is a spatial pixel position;
step 3.2, T-segment based Multi-information video sequence { C } t T=1, 2,..Counter-propagating in a multi-scale expansion convolutional encoding-decoding network, and continuously updating network parameters until the loss function +.>Tending to converge to obtain a trained multi-scale expansion convolution coding-decoding model; in a specific implementation, the number of samples taken for one training is 16. The learning rate of Adam optimizer was set to 0.0001, and the betas was set to 0.9 and 0.999. When the training period is 150, the loss function tends to converge, and the optimal multi-scale expansion convolution encoding-decoding model is stored.
Step 4, processing the intermittent object moving image to be predicted by using the trained multi-scale expansion convolution coding-decoding model to obtain the pixel-by-pixel foreground probability corresponding to the intermittent object moving image;
setting a threshold value P, comparing the pixel-by-pixel foreground probability of each pixel with the threshold value P, thereby setting the pixels larger than the threshold value P as foreground pixels, setting the rest pixels as background pixels, and obtaining a moving object segmentation result in the intermittent object moving image to be predicted.
In this example, the threshold P is set to 0.5, the pixels with values greater than 0.5 are set as foreground pixels, the remaining pixels are set as background pixels, and the resulting test set partial binarized foreground segmentation map is shown in fig. 3 a. The overall quantitative results of the test set are shown in table 1.
In this embodiment, an electronic device includes a memory for storing a program supporting the processor to execute the above method, and a processor configured to execute the program stored in the memory.
In this embodiment, a computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the method described above.
TABLE 1 ensemble averaging Precision, recall, F-Measure results for different categories in a dataset of the present invention
Epoch (optimum period) Precision Recall F-Measure
150 0.924 0.919 0.920
As shown in fig. 3c, the use of the pre-trained HRnet network to obtain the semantic information video sequence of the data set as input can strengthen the network's attention to the moving object, and reduce false detection caused by shadows and the like. Comparing the binarized foreground segmentation map shown in fig. 3d with the pixel-level label shown in fig. 3b, it can be seen that the multi-scale expansion convolution encoding-decoding network provided by the invention has good robustness and can be well adapted to various challenges in the real world, especially for small target motion detection.

Claims (6)

1. The moving object detection method based on multi-scale expansion convolution encoding-decoding is characterized by comprising the following steps of:
step 1, constructing a multi-information video sequence set;
step 1.1, selecting a T-section video sequence from video data with pixel-level labels, and carrying out normalization processing on each frame of image in the video sequence to obtain a normalized video sequence set S= { S 1 ,S 2 ,...,S t ,...,S T },S t Representing the t-th section viewFrequency sequence, andf i t for video sequence S of the t-th segment t M is the number of image frames in each video sequence; the true pixel level label defining the t-th segment video sequence is denoted Y t ∈{0,1};
Step 1.2, respectively toProcessing the previous X frame and the previous Y frame images to obtain a t-th segment ' blank ' video sequence S ' t And the t-th nearest video sequence S t ;1<X<Y<M;
Step 1.3, respectively performing S-layer training by using the pre-trained HRnet network t ,S′ t ,S″ t Processing to obtain t-th segment original semantic information video sequence F t T-th segment 'null' semantic information video sequence F t ' t-th segment nearest semantic information video sequence F t ″;
Step 1.4, constructing the t-th segment of the multi-information video sequence C t ={S t ,S′ t ,S″ t ,F t ,F t ′,F t ″};
Step 2, constructing a multi-scale expansion convolution encoding-decoding network, which comprises the following steps: the device comprises a multi-scale expansion convolutional coding module, a characteristic enhancement and detail processing module and a characteristic decoding module;
step 2.1, constructing the multi-scale expansion convolution coding module, wherein the multi-scale expansion convolution coding module is formed by connecting N coding blocks in series, and each coding block is formed by connecting one expansion convolution block and one convolution block; the expansion convolution block includes: a parallel convolution layer with different expansion rates; the outputs of the a-layer convolution layers are added and then are input into a subsequent convolution block after being processed by a Dropout layer; the convolution block is formed by serially connecting a convolution layer, a BN layer and a ReLU layer in sequence;
the t-th segment of the multi-information video sequence C t Processing the input multi-scale expansion convolution coding module to obtain the t-th section multi-scale output by N coding blocksCoding feature sequence O t ={O t 1 ,O t 2 ,...,O t n ,...,O t N -a }; wherein O is t n Representing the multi-scale coding features of the nth coding block output;
step 2.2, constructing the characteristic enhancement and detail processing module, wherein the characteristic enhancement and detail processing module is formed by connecting N characteristic processing blocks in parallel; each characteristic processing block is formed by connecting a parallel attention module and a detail processing module in parallel; the parallel attention module is formed by connecting a position attention unit and a channel attention unit in parallel;
step 2.2.1 parallel attention module position attention unit pair O in nth feature processing block t n Processing to obtain nth position fusion feature E t n
Step 2.2.2 nth channel attention unit pair O t n Processing to obtain the nth channel fusion characteristic
Step 2.2.3, E t n And (3) withThe nth position-channel fusion characteristic H output by the nth parallel attention module is obtained through linear addition and input convolution layer processing t n
Step 2.2.4, the detail processing module pair O t n Processing the obtained nth detail processing feature SC t n
Step 2.2.5 nth position-channel fusion feature H t n With the nth detail processing feature SC t n Adding to obtain an nth information enhancement feature HSC output by an nth feature processing block t n The method comprises the steps of carrying out a first treatment on the surface of the Thereby obtaining a t-th segment information enhancement feature sequence HSC output by the feature enhancement and detail processing module t ={HSC t 1 ,HSC t 2 ,...,HSC t n ,...,HSC t N };
Step 2.3, the feature decoding module consists of N deconvolution modules and a Sigmoid layer, wherein each deconvolution module consists of a deconvolution layer, a BN layer, a convolution layer, a BN layer and a ReLU layer which are connected in series;
enhancing the Nth information feature HSC t N Sending the result to an Nth deconvolution module for processing to obtain a result and HSC t N-1 Linear addition, and then sending the added result to the N-1 deconvolution module for processing to obtain a result and HSC t N-2 Linear addition, and so on, the final added result is sent to the 1 st deconvolution module for processing, and the obtained result is sent to the Sigmoid layer for processing, thereby obtaining the t-section multi-information video sequence C t Predicted pixel-by-pixel foreground probability
Step 3, training a network:
step 3.1, establishing a prediction foreground probability by using the formula (1)And pixel level label Y t T-th loss between>
In the formula (1), e is a smoothing parameter, m, n is a spatial pixel position;
step 3.2, T-segment based Multi-information video sequence { C } t T=1, 2,..Counter-propagating in a multi-scale expansion convolutional encoding-decoding network, and continuously updating network parameters until the loss function tends toConverging to obtain a trained multi-scale expansion convolution coding-decoding model;
step 4, processing the intermittent object moving image to be predicted by using the trained multi-scale expansion convolution coding-decoding model to obtain the pixel-by-pixel foreground probability corresponding to the intermittent object moving image;
setting a threshold value P, comparing the pixel-by-pixel foreground probability of each pixel with the threshold value P, thereby setting the pixels larger than the threshold value P as foreground pixels, setting the rest pixels as background pixels, and obtaining a moving object segmentation result in the intermittent object moving image to be predicted.
2. The moving object detection method based on multi-scale expansion convolutional encoding-decoding according to claim 1, wherein the step 2.2.1 comprises:
will n-th multiscale encoded feature O t n Inputting the data into an nth parallel attention module corresponding to an nth feature processing block, and using a pooling scale as l by an nth position attention unit 1 ,l 2 ,…,l r ,…,l L Is respectively corresponding to O by an adaptive pooling layer and a convolution layer t n Processing to obtain L pooling features R t 1 ,R t 2 ,...,R t r ,...,R t L And dimension-reducing feature F 0 The method comprises the steps of carrying out a first treatment on the surface of the Wherein l r Represents the pooling scale of the R-th adaptive pooling layer, R t r Represents O t n Pooling features obtained through the r-th adaptive pooling layer; and then to R t 1 ,R t 2 ,...,R t r ,...,R t L After cascade connection, the nth aggregation center is obtainedWherein C represents the number of channels of the nth aggregation center, J represents the sum of pooling scales of the L adaptive pooling layers, j=l 1 +l 2 +…+l r +…+l L
Will be the nth aggregation center W t n Inputting the processed full connection layer in the nth position attention unit, and then combining with the dimension reduction feature F 0 After multiplication, the obtained product is input into a Softmax layer for processing, and an nth position attention S is obtained t n The method comprises the steps of carrying out a first treatment on the surface of the Will S t n And multiscale encoding feature O t n After multiplication, the nth position fusion feature E output by the nth parallel attention module is obtained t n
3. The moving object detection method based on multi-scale expansion convolutional encoding-decoding according to claim 2, wherein the step 2.2.2 comprises:
will n-th multiscale encoded feature O t n Input into the nth parallel attention module corresponding to the nth feature processing block, and the nth channel attention unit uses a layer of convolution layer to pair O t n After the dimension reduction, the mixture is combined with O t n Multiplication is carried out on the transpose of the (2), and the obtained product is processed by a Softmax layer to obtain an nth channel attention diagram U t n The method comprises the steps of carrying out a first treatment on the surface of the U is set to t n And multiscale encoding feature O t n After multiplication, the nth channel fusion characteristic output by the nth channel attention unit is obtained
4. The moving object detection method based on multi-scale expansion convolutional encoding-decoding according to claim 3, wherein the step 2.2.4 comprises:
the detail processing module is formed by connecting an edge detail branch and a context information branch in parallel, and an nth edge detail branch in the detail processing module corresponding to an nth feature processing block uses a Sobel operator to code an nth multi-scale coding feature O t n After processing in the horizontal and vertical directions, obtaining two results, performing linear addition, and inputting into a convolution layer for processing, thereby obtaining an nth edge feature SB t n
The context information branch in the detail processing module corresponding to the nth feature processing block encodes the nth multi-scale coding feature O t n After the processing of sequentially inputting the convolution layer and the Softmax layer, the obtained result is processed by the convolution layer and the activation function leak ReLU layer, thereby obtaining the nth context characteristic CB t n
SB by channel dimension t n And CB t n After being spliced, the processed data are input into a convolution layer to obtain an nth detail processing characteristic SC output by an nth detail processing module t n
5. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that supports the processor to execute the moving object detection method according to any one of claims 1 to 4, the processor being configured to execute the program stored in the memory.
6.A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor performs the steps of the moving object detection method according to any of claims 1-4.
CN202311179071.8A 2023-09-13 2023-09-13 Moving object detection method based on multi-scale expansion convolution encoding-decoding Pending CN117197183A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311179071.8A CN117197183A (en) 2023-09-13 2023-09-13 Moving object detection method based on multi-scale expansion convolution encoding-decoding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311179071.8A CN117197183A (en) 2023-09-13 2023-09-13 Moving object detection method based on multi-scale expansion convolution encoding-decoding

Publications (1)

Publication Number Publication Date
CN117197183A true CN117197183A (en) 2023-12-08

Family

ID=88988359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311179071.8A Pending CN117197183A (en) 2023-09-13 2023-09-13 Moving object detection method based on multi-scale expansion convolution encoding-decoding

Country Status (1)

Country Link
CN (1) CN117197183A (en)

Similar Documents

Publication Publication Date Title
CN109558832B (en) Human body posture detection method, device, equipment and storage medium
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
CN108229338B (en) Video behavior identification method based on deep convolution characteristics
CN109583340B (en) Video target detection method based on deep learning
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
CN110378288B (en) Deep learning-based multi-stage space-time moving target detection method
CN109522855B (en) Low-resolution pedestrian detection method and system combining ResNet and SENet and storage medium
CN110334589B (en) High-time-sequence 3D neural network action identification method based on hole convolution
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
Heo et al. Monocular depth estimation using whole strip masking and reliability-based refinement
CN113591968A (en) Infrared weak and small target detection method based on asymmetric attention feature fusion
CN111626993A (en) Image automatic detection counting method and system based on embedded FEFnet network
CN110334656B (en) Multi-source remote sensing image water body extraction method and device based on information source probability weighting
CN112597815A (en) Synthetic aperture radar image ship detection method based on Group-G0 model
CN114037674B (en) Industrial defect image segmentation detection method and device based on semantic context
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN113392711A (en) Smoke semantic segmentation method and system based on high-level semantics and noise suppression
CN112084952B (en) Video point location tracking method based on self-supervision training
CN114549567A (en) Disguised target image segmentation method based on omnibearing sensing
CN107609571A (en) A kind of adaptive target tracking method based on LARK features
CN114842238A (en) Embedded mammary gland ultrasonic image identification method
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN114359626A (en) Visible light-thermal infrared obvious target detection method based on condition generation countermeasure network
CN111539434B (en) Infrared weak and small target detection method based on similarity
CN113627342B (en) Method, system, equipment and storage medium for video depth feature extraction optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination