CN114494934A

CN114494934A - Unsupervised moving object detection method based on information reduction rate

Info

Publication number: CN114494934A
Application number: CN202111510928.0A
Authority: CN
Inventors: 李军; 刘江; 付孟祥; 王子文; 张书恒
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-05-13

Abstract

The invention discloses an unsupervised moving target detection method based on an information reduction rate. The method comprises the following steps: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database; calculating to obtain an optical flow image corresponding to the video sequence through the trained PWCNet and normalizing; taking a video sequence and a corresponding optical flow image thereof as input, and training a generating type confrontation network model; the video sequence to be detected is processed in the same way; and extracting a generator module of the trained generative confrontation network model, and detecting the moving target in the video sequence to be detected. Based on the property that the background image area does not contain information of the foreground image area, a generating type confrontation network model is constructed according to the relation of optical flow to realize the discrimination of the background and the moving target, and the generating type confrontation network comprises a generator and a restorer; the characteristic channels of the moving target are fused through an attention mechanism, so that the background interference is reduced, and the detection performance of the moving target is improved.

Description

Unsupervised moving object detection method based on information reduction rate

Technical Field

The invention belongs to the field of deep learning in computer vision, and particularly relates to an unsupervised moving target detection method based on an information reduction rate.

Background

The target detection is an important branch of computer vision, and the main purpose of the target detection is to extract a moving target as a foreground from a video sequence, and the environment around the moving target opposite to the foreground is used as a background and is separated from the moving target. As a cross comprehensive subject, the target detection integrates theories and algorithms in multiple fields of image processing, machine learning, optimization and the like, and is a premise and basis for completing a task of image understanding (such as target behavior recognition) at a higher level. The target detection technology has great research and application values and is widely applied to the fields of intelligent video monitoring, intelligent human-computer interaction, intelligent traffic, visual navigation, unmanned driving, unmanned autonomous flight, battlefield situation reconnaissance and the like. In recent years, with the development of computer technology and deep learning technology, target detection models have been developed and evolved, and various detection models have been created.

In the field of target detection, the average overlap ratio IoU between the target object and the predicted result is often used as a core evaluation criterion. In recent years, studies of target detection can be divided into two categories: one is a supervised learning approach; another class is unsupervised learning methods. The PDB algorithm is a typical supervised algorithm, and spatial features are extracted simultaneously in multiple scales by using a pyramid expansion volume module, and are connected and input to an expanded DB-ConvLSTM structure to learn time domain information, so that a better detection result is obtained. For an unsupervised target detection algorithm, the method has the greatest characteristic that a large number of labeled samples are not needed, and a large development space is provided. SAGE algorithm generates a space-time saliency map to estimate background and foreground information by calculating geodesic distances between superpixels and edge pixels, but the method mainly depends on edge features and motion gradient features of images, and noise regions are easily generated in complex texture scenes. The CIS algorithm uses the concept of a generative countermeasure network for reference, and can detect a moving object well by distinguishing a background from the moving object by a defined information reduction rate based on optical flow information. However, for the existing unsupervised target tracking algorithm, the performance of many algorithms is degraded when the target image is shot due to factors such as angle, illumination, shielding, background interference, noise caused by equipment, and the like.

Disclosure of Invention

The invention aims to provide an unsupervised moving object detection method based on an information reduction rate, which fully utilizes optical flow information of an object and a background, fuses characteristic channels of a moving object through an attention mechanism, reduces interference of the background and improves the detection performance of the moving object.

The technical solution for realizing the purpose of the invention is as follows: an unsupervised moving object detection method based on information reduction rate comprises the following steps:

step 1: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database;

step 2: calculating to obtain an optical flow image corresponding to the video sequence through the trained PWCNet, and normalizing;

and step 3: taking a video sequence and a corresponding optical flow image thereof as input, and training a generating type confrontation network model;

and 4, step 4: the video sequence to be detected is processed in the steps 1 to 2;

and 5: and extracting a generator module of the trained generative confrontation network model, and detecting the moving target in the video sequence to be detected.

Compared with the prior art, the invention has the following remarkable advantages: (1) constructing a generating type confrontation network model according to the relation of the optical flow based on the property that the background image area does not contain the information of the foreground image area to realize the judgment of the background and the moving target, wherein the generating type confrontation network comprises a generator and a restorer; an attention mechanism is introduced, so that the robustness of a tracking algorithm is effectively improved, and the interference of background noise and the like on the tracking of the target is reduced; (2) the light stream information of the target and the background is fully utilized, the characteristic channels of the moving target are fused through an attention mechanism, the interference of the background is reduced, and the detection performance of the moving target is improved.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a diagram of a basic network architecture of the present invention.

Fig. 3 is a diagram of the detection output result of the generator module of the network model for the moving object in the partial video sequence.

FIG. 4 is a network architecture diagram of a generator module of the network model.

Detailed Description

The invention relates to an unsupervised moving target detection method based on information reduction rate, which comprises the following steps:

Further, the training generation type confrontation network model in step 3 specifically comprises the following steps:

step 3.1, distinguishing the moving target from the background:

based on the principle that the background image region should not contain information of the moving object foreground image region, the image of the region of interest can be interpreted as poorly as possible by learning regions outside the region of interest. Specifically, for a frame of image I of a video sequence, an image region is assumed to be D, an image region of a moving object is assumed to be Ω, and a background is assumed to be Ω^cD/omega, which flows to the adjacent frame (last one)Frame or next frame) is u. The optical flow represents apparent motion of an image brightness pattern, and includes important information of the surface structure and dynamic behavior of an object. Use of

Representing mutual information of two random variables, given optical flows u at positions I, j in image I_i、u_jThe concept of the foreground Ω is formalized as an area with mutual information 0 with the background:

wherein the mutual information

Optical flow u representing a position j in a given image I_jOptical flow u provided in relation to position i_iThe larger the mutual information value is, the larger the provided information amount is; shannon information entropy H (u)_iI) represents u_iThe larger the uncertainty of the variable is, the larger the information entropy is, and the value is always larger than 0; h (u)_i|u_jI) is represented in the known u_jUnder the conditions of (a) u_iUncertainty of (d);

step 3.2, loss function based on information reduction rate:

according to the foreground and the background defined above, an information reduction rate is defined to construct an optimization target by combining a Shannon information entropy theory; using two subsets in D, namely an area x and an area y as input, wherein the optical flows of the area x and the area y are respectively u_x、u_yThe information reduction rate γ is defined as follows:

wherein the content of the first and second substances,

representing a given graphOptical flow u of area y in image I_yThe luminous flux u about the area x that can be provided_xThe amount of information of (a); shannon information entropy H (u)_xI) represents u_xUncertainty of (d); hu_x|u_yI) is represented in the known u_yUnder the conditions of (a) u_xUncertainty of (d);

gamma (x | y; I) represents known u_yUnder the conditions of (a) u_xThe value of the uncertainty reduction amount of (a) is between 0 and 1; when u is_xAnd u_yIndependently, i.e. one belongs to the foreground and one to the background image area, γ is 0; u for optical flow in target image region Ωⁱⁿ＝{u_iI ∈ Ω } and background region Ω^cU in^out＝{u_j,j∈Ω^cRepresents, thus, there are:

wherein, P (u)ⁱⁿI) represents the probability that the optical flow is foreground optical flow, P (u)ⁱⁿ|u^outI) represents known u^outUnder the conditions of (a) uⁱⁿThe probability of (d); defining a loss function as

When in use

At a minimum, the optical flow of the background is sufficient to predict the foreground;

strict assumptions are made about the model as follows:

wherein phi (omega, y, I) ═ u-ⁱⁿdP(uⁱⁿ|u^out,I)；||||²Representing the vector modulo, sigma representing the variance;

meanwhile, introducing a function chi to express D, omega and omega^c：

Therefore, the optical flow into Ω is represented by u_i ⁱⁿ＝χu_iThe outflow is u_i ^out＝-1-χ)u_i；

Finally, selecting chi and phi as parameter function classes in the convolutional neural network, and expressing parameters by w, wherein the corresponding function is

And

to simplify the representation, the loss function is omitted

And converting the constant term into the inverse of the original loss function to obtain the final loss function

Wherein, the first and the second end of the pipe are connected with each other,

to recover i, the above equation is minimized, w₁Is a parameter of restorer i;

to generator g, select

To make u get_i ^outIs not u_i ⁱⁿProviding information so that the above formula is maximized, w₂To the parameters of the generator g(ii) a I is an image;

finally, the optimization objective

Expressed in the following form:

3.3, constructing a generator g and a restorer i, wherein the generator g and the restorer i jointly form a generating type countermeasure network, and solving the optimization problem in the step 3.2; the generator g is used for generating an optical flow mask image mask of the moving object; the restorer i restores optical flow information in the mask image according to the mask image generated by the generator g and the corresponding optical flow image by taking the CPN as a basic network architecture;

and 3.4, training the constructed generative confrontation network by using the DAVIS2016 data set to obtain a final generative confrontation network model.

Further, the generator g and the restorer i in step 3.3 jointly form a generative confrontation network, and the specific model is as follows:

1) generator g inputs RGB image I_tAnd its corresponding optical flow u_t:t+δTOutputting a mask image mask as a moving target, wherein δ T is randomly sampled between uniform distributions U [ -5,5 and δ T ≠ 0, thus introducing more about image I_tChange information of optical flow; the generator g consists of an encoder and a decoder; the encoder part consists of 5 convolutional layers, each of which is followed by a BN layer, each of which reduces the original image to 1/4 of the input image; 4 cavity convolution layers with gradually increased radiuses are arranged behind the encoder, and the radiuses are 2, 4, 8 and 16 in sequence; the decoder part is composed of 5 convolution layers and generates a mask image with the same size as the input image through up-sampling;

2) restorer I input as RGB image I_tAnd a mask image mask generated by the generator g, which is output as an optical flow image other than the predicted mask image, that is, an optical flow image of the background; recoveryThe encoder part of the device i comprises two branches, the structures and parameters of the two branches are completely the same, the two branches are respectively composed of 9 convolutional layers, and LeakyReLu is used as an activation function after each convolutional layer. One of the network branches takes as input the normalized frame image and the other branch takes as input the optical flow image and the mask image generated by the generator. The features of the two network branch codes are connected by using a splicing operation (concat) and then transmitted to a decoder, the decoder mainly comprises an deconvolution layer and a LeakyReLu activation function, and meanwhile, a jump structure is used for performing feature fusion on deep features after up-sampling with shallow features. And finally outputting the optical flow image with the same size as the input image.

Further, the encoder parts of the generator g and the restorer i of the generative confrontation network model of step 3 introduce a light attention mechanism; the attention module includes channel attention, spatial attention, and global attention.

1) Channel attention mainly includes three operations: squeeze, excite, and recalibrate. Firstly, for a feature graph F with an input size of h multiplied by w multiplied by c, compressing input features in a space dimension through extrusion operation to obtain a feature vector s with a size of 1 multiplied by c, representing the global features of the channel, wherein each element in the feature vector corresponds to each channel in the feature graph, and actually, the process of globally pooling each feature graph is carried out; then, establishing the relevance among the channels through excitation operation, learning the direct relevance of c channels by using the weight w, obtaining a channel weight e with the size of 1 × 1 × c, and generally realizing the channel weight e through 1 × 1 convolution operation; finally, multiplying the weight of the channel by the original input characteristic diagram through recalibration operation to obtain a weighted output characteristic diagram F_C′。

2) To feature map F'_CTwo feature matrices F are generated using maximum pooling and average pooling operations, respectively_MAXAnd F_AVG(ii) a Then, the two feature matrixes are fused to obtain a fused feature map F_MAThe spatial attention weight W is obtained through Sigmoid activation function processing, wherein the fusion operation is generally to simply put through the feature matrixPerforming convolution operation after the channel splicing; finally, multiplying the space attention weight matrix W with the original input feature map F matrix to obtain a weighted output feature map F_S′。

3) The global attention squeeze operation is the same as the channel attention, and for the actuation operation therein we replace it with an implementation by 4 consecutive operations: fc (2C/16) → ReLU → fc (1) → Sigmoid, the excitation generating a size selection factor

Where fc (·) denotes full connection operation, C is the number of channels, and both ReLU and Sigmoid are activation functions. Output F according to the spatial attention mechanism_S' calculation of the sum size selection factor mu yields the feature F for which the size is sensitive_S′_caleAs shown in the following formula:

F_G′＝F+(μ*F_S′)

in order to avoid losing some important information of the area with the attention value close to 0, an identity mapping item F is added.

Further, in the step 4 to the step 5, a generator module for extracting the trained generative confrontation network model detects a moving target in the video sequence to be detected, and the specific steps are as follows:

firstly, carrying out the preprocessing operation of the step one on a video sequence to be detected;

then, calculating according to the method in the second step to obtain a corresponding optical flow image;

finally, the preprocessed video sequence image and the corresponding optical flow image are input into the generator g obtained in the third step, and the obtained output image is the prediction result of the moving object.

The present invention is further illustrated in the accompanying drawings, which are included to provide a further understanding of the invention and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will become apparent to those skilled in the art after reading the present specification and the appended claims.

Examples

The invention provides an unsupervised moving target detection method with an attention mechanism. Constructing a generating type confrontation network model according to the relation of optical flow based on the property that the background image area does not contain the information of the foreground image area to realize the discrimination of the background and the moving target, wherein the generating type confrontation network comprises a generator and a restorer; an attention mechanism is introduced, so that the robustness of a tracking algorithm is effectively improved, and the interference of background noise and the like on the tracking of the target is reduced. The basic idea is as follows: firstly, a video database is constructed, and the video is preprocessed; then, adjacent frames of each video are calculated by using the PWCNet to obtain optical flow information; secondly, the video obtained through preprocessing and the corresponding optical flow information are used as the input of a generating type countermeasure network based on an attention mechanism, and a network model is trained; and finally, for the video sequence to be detected, a generator module of the network model is used for obtaining a detection result of the moving target.

As shown in fig. 1, the implementation of the present invention mainly comprises four steps: (1) preprocessing a video sequence; (2) obtaining an optical flow image of the video sequence through PWCNet; (3) taking a video sequence and an optical flow image corresponding to the video sequence as an input training generation type countermeasure network; (4) and detecting the motion target in the video sequence by utilizing a generator module of the trained network model, and outputting a detection result.

The method comprises the following steps: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database;

because the collected video sequence under the natural scene may have interference of factors such as uneven illumination and the like, the video sequence is preprocessed, and the preprocessing mainly comprises histogram equalization, normalization and the like of the video sequence;

step two: obtaining an optical flow image of the video sequence through PWCNet;

given the optical flow u: D of the image I to be measured for one frame down (up)₁→R²Is that

To

To (3) is performed. PWCNet is a high-performance optical flow learning network, which can efficiently acquire optical flow information of video sequences. The invention adopts PWCNet to calculate optical flow information and carries out normalization, wherein the normalization operation mainly comprises the steps of adjusting the optical flow image to be the same as the video sequence, and then dividing the optical flow image by a constant, namely, reducing the value of the optical flow image in equal proportion to accelerate the training of the network.

Step three: taking a video sequence and an optical flow image corresponding to the video sequence as an input training generation type countermeasure network;

and 3.1, distinguishing the moving object from the background. Based on the principle that the background image region should not contain information of the moving object foreground image region, the image of the region of interest can be interpreted as poorly as possible by learning regions outside the region of interest. Specifically, for a frame of image I of a video sequence, an image region is assumed to be D, an image region of a moving object is assumed to be Ω, and a background is assumed to be Ω^cD/Ω, and the optical flow to an adjacent frame (the previous frame or the next frame) is u. The optical flow represents the apparent motion of the image brightness pattern, and contains important information of the surface structure and dynamic behavior of the object. Use of

Representing mutual information of two random variables, given optical flows u at two locations in an image I_i、u_jThe concept of foreground Ω can be formalized as an area with 0 mutual information with the background: :

wherein the mutual information

Representing the optical flow u_jProvided with respect to predicted optical flow u_iThe larger the value, the larger the amount of information provided;

the information entropy is represented and used for quantifying the size of the information quantity, the larger the uncertainty of the variable is, the larger the information entropy is, and the value is always larger than 0.

And 3.2, based on the loss function of the information reduction rate. According to the foreground and the background defined above, an information reduction rate is defined to construct an optimization target by combining a Shannon information entropy theory; with two subsets (regions) x, y in D as inputs, the information reduction rate γ is defined as follows:

wherein the content of the first and second substances,

representing the optical flow u_yProvided with respect to predicted optical flow u_xThe amount of information of (a); shannon information entropy H (u)_x|I)u_xRepresents u_xUncertainty of (d); hu_x|u_yI) is represented in the known u_yUnder the conditions of (a) u_xUncertainty of (d); gamma (x | y; I) represents known u_yUnder the conditions of (a) u_xThe value of the uncertainty reduction of (a) is between 0 and 1. In particular, when u_xAnd u_yIndependently, i.e. one belongs to the foreground and one to the background image area, γ is 0; u for optical flow in target image region Ωⁱⁿ＝{u_iI ∈ Ω } and background region Ω^cU in^out＝{u_j,j∈Ω^cRepresents, thus, there are:

wherein, P (u)ⁱⁿI) represents the probability that the optical flow is foreground optical flow, P (u)ⁱⁿ|u^outI) represents known u^outUnder the conditions of (a) uⁱⁿThe probability of (c). Defining a loss function as

When in use

At a minimum, the optical flow of the background is sufficient to predict the foreground. Strict assumptions are made about the model as follows:

wherein phi (omega, y, I) ═ u-ⁱⁿdP(uⁱⁿ|u^out,I)；||||²Representing the vector modulo, sigma the variance. Meanwhile, introducing a function chi to express D, omega and omega^c：

Therefore, the optical flow into Ω is represented by u_i ⁱⁿ＝χu_iThe outflow is u_i ^out＝(1-χ)u_i；

Finally, χ and φ are selected as classes of parametric functions in the convolutional neural network, with w representing the parameter and the corresponding function being

And

to simplify the representation, the loss function is omitted

And the constant term of (2) is converted into the inverse number of the original loss function, so that the final loss function can be obtained

Wherein the content of the first and second substances,

to recover i, the above equation is minimized, w₁Is a parameter thereof;

to generate g, the appropriate one is selected

So that u is_i ^outIs not u_i ⁱⁿProviding information so that the above formula is maximized, w₂Is a parameter thereof; and I is an image.

Finally, the optimization objective

Expressed in the following form:

and 3.3, constructing a generator g and a restorer i which jointly form a generative countermeasure network, and effectively solving the optimization problem in the step 3.2. The generator g includes encoder and decoder sections for generating an optical flow mask image of the moving object, the network structure and parameters of which are shown in table 1. The restorer i includes an encoder and a decoder part, and optical flow information other than the mask image can be restored from the mask image generated by the generator g, and its network structure and parameters are shown in table 2.

1) Generator g inputs RGB image I_tAnd its corresponding optical flow u_t:t+δTOutputting a mask image mask of the moving target, wherein the delta T is uniformly distributed in U [ -5,5 [ ]]With random sampling and δ T ≠ 0, which introduces more about the image I_tChange information of optical flow; the generator g consists of an encoder and a decoder; the encoder part consists of 5 convolutional layers eachEach convolution layer is followed by a BatchNormalization layer, and each convolution layer reduces the original image to 1/4 of the input image; 4 cavity convolution layers with gradually increased radiuses are arranged behind the encoder, and the radiuses are 2, 4, 8 and 16 in sequence; the decoder part is composed of 5 convolution layers and generates a mask image with the same size as the input image through up-sampling;

2) restorer I input as RGB image I_tAnd a mask image mask generated by the generator g, which is output as an optical flow image other than the predicted mask image, that is, an optical flow image of the background; the encoder part of restorer i comprises two branches, and the structure and parameters of the two branches are completely the same, and are respectively composed of 9 convolutional layers, and LeakyReLu is used as an activation function after each convolutional layer. One of the network branches takes as input the normalized frame image and the other branch takes as input the optical flow image and the mask image generated by the generator. The features of the two network branch codes are connected by using a splicing operation (concat) and then transmitted to a decoder, the decoder mainly comprises an deconvolution layer and a LeakyReLu activation function, and meanwhile, a jump structure is used for performing feature fusion on the deep features and the shallow features after up-sampling. And finally outputting the optical flow image with the same size as the input image.

TABLE 1 Generator network parameters

Note 1: each convolution layer is followed by a Batch Normalization, which is not shown.

Note 2: the hole convolution fills rate-1 0 in the middle of a convolution kernel, so that the receptive field can be enlarged, and multi-scale context information can be captured.

Note 3: the deconvolution layer can realize signal restoration and up-sampling.

Note 4: attention modules are added to convolutional layers 2-3, 4-5, 7-10, 11-12 to reduce the interference of background noise.

Table 2 restorer network parameters

And 3.4, training the constructed generative confrontation network by using the training data set to obtain a final network model.

Step four: detecting a moving target in a video sequence by using a generator g of the trained network model;

firstly, carrying out the preprocessing operation of the step one on a video sequence to be detected; then, calculating according to the method in the second step to obtain a corresponding optical flow image; and finally, inputting the preprocessed video sequence image and the corresponding optical flow image into a generator g obtained in the third step, wherein the obtained output image is the mask image of the moving target.

The invention relates to an unsupervised moving target detection method based on information reduction rate, which is characterized in that a generative confrontation network model is constructed according to the relation of optical flow based on the property that a background image area does not contain information of a foreground image area, so as to realize the discrimination of a background and a moving target, and the generative confrontation network comprises a generator and a restorer; the attention mechanism is introduced, so that the robustness of a tracking algorithm is effectively improved, and the interference of background noise and the like on the tracking of the target is reduced. The basic idea is as follows: firstly, constructing a video database, and preprocessing videos; then, adjacent frames of each video are calculated by using PWCNet to obtain optical flow information; then, the video obtained through preprocessing and the corresponding optical flow information are used as the input of a generating type countermeasure network based on an attention mechanism, and a network model is trained; and finally, for the video sequence to be detected, a generator module of the network model is used for obtaining a detection result of the moving target. Compared with the existing unsupervised moving object detection algorithm, the method fully utilizes the optical flow information of the object and the background, fuses the characteristic channels of the moving object through an attention mechanism, reduces the interference of the background, and improves the detection performance of the moving object.

Claims

1. An unsupervised moving object detection method based on information reduction rate is characterized by comprising the following steps:

2. The unsupervised moving object detecting method based on information reduction rate as claimed in claim 1, wherein the training generating confrontation network model in step 3 specifically comprises the following steps:

step 3.1, distinguishing the moving target from the background:

for a frame of image I of a video sequence, an image area is assumed to be D, an image area of a moving object is assumed to be omega, and a background is assumed to be omega^cD/Ω, the optical flow of the current frame flowing to the adjacent frame is u, and the adjacent frame is the previous frame or the next frame; wherein the optical flow represents the apparent motion of the image brightness mode and contains the information of the surface structure and dynamic behavior of the object; use of

Representing mutual information of two random variables, giving optical flow u at position I, position j in image I_i、u_jThe concept of the foreground Ω is formalized as an area with 0 mutual information with the background:

wherein the mutual information

Optical flow u representing a position j in a given image I_jOptical flow u provided in relation to position i_iThe larger the mutual information value is, the larger the provided information amount is;

shannon information entropy H (u)_iI) represents u_iThe larger the uncertainty of the variable is, the larger the information entropy is, and the value is always larger than 0; h (u)_i|u_jI) is represented in the known u_jUnder the conditions of (a) u_iUncertainty of (d);

step 3.2, loss function based on information reduction rate:

according to the foreground and the background defined above, the Shannon information entropy theory is combined, and the information reduction rate is defined to construct an optimization target; using two subsets in D, namely area x and area y as input, and the optical flows of area x and area y are respectively u_x、u_yThe information reduction rate γ is defined as follows:

wherein the content of the first and second substances,

optical flow u representing an area y in a given image I_yOptical flow u about area x that can be provided_xThe amount of information of (a); shannon information entropy H (u)_xI) represents u_xUncertainty of (d); h (u)_x|u_yI) is represented in the known u_yUnder the conditions of (a) u_xUncertainty of (d);

When in use

strict assumptions are made about the model as follows:

wherein phi (omega, y, I) ═ u-ⁱⁿdP(uⁱⁿ|u^out,I)；|| ||²Representing the vector modulo, sigma representing the variance;

meanwhile, introducing a function chi to express D, omega and omega^c：

And

omitting loss functions

And (4) converting the constant term into the inverse number of the original loss function to obtain the final loss function

Wherein the content of the first and second substances,

to generator g, select

So that u is_i ^outIs not u_i ⁱⁿProviding information so that the above formula is maximized, w₂Are the parameters of generator g; i is an image;

final optimization objective

Expressed in the following form:

3.3, constructing a generator g and a restorer i, wherein the generator g and the restorer i jointly form a generating type countermeasure network, and solving the optimization problem in the step 3.2; the generator g is used for generating an optical flow mask image mask of the moving target; the restorer i restores optical flow information in the mask image according to the mask image generated by the generator g and the corresponding optical flow image by taking the CPN as a basic network architecture;

3. The unsupervised moving object detecting method based on information reduction rate as claimed in claim 2, wherein the generator g and the restorer i in step 3.3 together form a generative confrontation network, and the concrete model is as follows:

1) generator g inputs RGB image I_tAnd its corresponding optical flow u_t:t+δTOutputting a mask image mask of the moving target, wherein the delta T is uniformly distributed in U [ -5,5 [ ]]With δ T ≠ 0, thus introducing more about the image I_tChange information of optical flow; the generator g consists of an encoder and a decoder; the encoder portion consists of 5 convolutional layers, each followed by a BN layer, each convolutional layer reducing the original image to 1/4 of the input image; 4 cavity convolution layers with gradually increased radiuses are arranged behind the encoder, and the radiuses are 2, 4, 8 and 16 in sequence; the decoder part is composed of 5 convolution layers and generates a mask image with the same size as the input image through up-sampling;

2) restorer I input as RGB image I_tAnd a mask image mask generated by the generator g, which is output as an optical flow image other than the predicted mask image, that is, an optical flow image of the background; the encoder part of the restorer i comprises two branches, the structures and parameters of the two branches are completely the same, the two branches are respectively composed of 9 convolutional layers, and LeakyReLu is used as an activation function after each convolutional layer; one of the network branches takes the normalized frame image as input, and the other branch takes the optical flow image and the mask image generated by the generator as input; concatenating two network branch encoded features using a concatenation operation concatThen transmitting the data to a decoder, wherein the decoder mainly comprises an deconvolution layer and a LeakyReLu activation function, and simultaneously performing feature fusion on deep features and shallow features after up-sampling the deep features by using a jump structure; and finally outputting the optical flow image with the same size as the input image.

4. The unsupervised moving object detecting method based on information reduction rate as claimed in claim 3, wherein the encoder part of the generator g and restorer i of the generative confrontation network model in step 3 introduces a lightweight attention mechanism, and the attention module includes channel attention, spatial attention and global attention:

1) the channel attention includes three operations: extruding, exciting and recalibrating; firstly, for a feature graph F with an input size of h multiplied by w multiplied by c, compressing input features in a space dimension through extrusion operation to obtain a feature vector s with the size of 1 multiplied by c, representing the global features of a channel, wherein each element in the feature vector corresponds to each channel in the feature graph, and actually, the process of performing global pooling on each feature graph is performed; then, establishing relevance among channels through excitation operation, learning direct relevance of c channels by using the weight w, obtaining a channel weight e with the size of 1 multiplied by c, and realizing the relevance through 1 multiplied by 1 convolution operation; finally, multiplying the weight of the channel by the original input feature map through a recalibration operation to obtain a weighted output feature map F'_C；

2) To feature map F'_CTwo feature matrices F are generated using maximum pooling and average pooling operations, respectively_MAXAnd F_AVG(ii) a Then, the two feature matrixes are fused to obtain a fused feature map F_MAObtaining a space attention weight W through Sigmoid activated function processing, wherein the fusion operation comprises splicing the feature matrixes according to channels and then performing convolution operation; finally, multiplying the space attention weight matrix W with the original input feature map F matrix to obtain a weighted output feature map F'_S；

3) The global attention squeeze operation is the same as the channel attention, and for the excitation operation therein we willIt is replaced by an implementation of 4 consecutive operations: fc (2C/16) → ReLU → fc (1) → Sigmoid, the excitation generating a size selection factor

Wherein fc (·) represents full connection operation, C is the number of channels, and both ReLU and Sigmoid are activation functions; output F 'produced from space attention machine'_SAnd a size selection factor mu are calculated to obtain a size-sensitive feature F'_ScaleAs shown in the following formula:

F′_G＝F+(μ*F′_S)

in order to avoid losing important information of the area with the attention value close to 0, an identity mapping item F is added.

5. The unsupervised moving object detection method based on information reduction rate as claimed in claim 4, wherein in the steps 4-5, the generator module of the trained generating confrontation network model is extracted to detect the moving object in the video sequence to be detected, and the specific steps are as follows:

firstly, carrying out the preprocessing operation of the step 1 on a video sequence to be detected;

then, calculating according to the method in the step 2 to obtain a corresponding optical flow image;

finally, the preprocessed video sequence images and the corresponding optical flow images are input into the generator g obtained in step 3, and the obtained output images are the prediction results of the moving objects.