CN114494934A - Unsupervised moving object detection method based on information reduction rate - Google Patents

Unsupervised moving object detection method based on information reduction rate Download PDF

Info

Publication number
CN114494934A
CN114494934A CN202111510928.0A CN202111510928A CN114494934A CN 114494934 A CN114494934 A CN 114494934A CN 202111510928 A CN202111510928 A CN 202111510928A CN 114494934 A CN114494934 A CN 114494934A
Authority
CN
China
Prior art keywords
image
optical flow
information
video sequence
generator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111510928.0A
Other languages
Chinese (zh)
Inventor
李军
刘江
付孟祥
王子文
张书恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202111510928.0A priority Critical patent/CN114494934A/en
Publication of CN114494934A publication Critical patent/CN114494934A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an unsupervised moving target detection method based on an information reduction rate. The method comprises the following steps: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database; calculating to obtain an optical flow image corresponding to the video sequence through the trained PWCNet and normalizing; taking a video sequence and a corresponding optical flow image thereof as input, and training a generating type confrontation network model; the video sequence to be detected is processed in the same way; and extracting a generator module of the trained generative confrontation network model, and detecting the moving target in the video sequence to be detected. Based on the property that the background image area does not contain information of the foreground image area, a generating type confrontation network model is constructed according to the relation of optical flow to realize the discrimination of the background and the moving target, and the generating type confrontation network comprises a generator and a restorer; the characteristic channels of the moving target are fused through an attention mechanism, so that the background interference is reduced, and the detection performance of the moving target is improved.

Description

Unsupervised moving object detection method based on information reduction rate
Technical Field
The invention belongs to the field of deep learning in computer vision, and particularly relates to an unsupervised moving target detection method based on an information reduction rate.
Background
The target detection is an important branch of computer vision, and the main purpose of the target detection is to extract a moving target as a foreground from a video sequence, and the environment around the moving target opposite to the foreground is used as a background and is separated from the moving target. As a cross comprehensive subject, the target detection integrates theories and algorithms in multiple fields of image processing, machine learning, optimization and the like, and is a premise and basis for completing a task of image understanding (such as target behavior recognition) at a higher level. The target detection technology has great research and application values and is widely applied to the fields of intelligent video monitoring, intelligent human-computer interaction, intelligent traffic, visual navigation, unmanned driving, unmanned autonomous flight, battlefield situation reconnaissance and the like. In recent years, with the development of computer technology and deep learning technology, target detection models have been developed and evolved, and various detection models have been created.
In the field of target detection, the average overlap ratio IoU between the target object and the predicted result is often used as a core evaluation criterion. In recent years, studies of target detection can be divided into two categories: one is a supervised learning approach; another class is unsupervised learning methods. The PDB algorithm is a typical supervised algorithm, and spatial features are extracted simultaneously in multiple scales by using a pyramid expansion volume module, and are connected and input to an expanded DB-ConvLSTM structure to learn time domain information, so that a better detection result is obtained. For an unsupervised target detection algorithm, the method has the greatest characteristic that a large number of labeled samples are not needed, and a large development space is provided. SAGE algorithm generates a space-time saliency map to estimate background and foreground information by calculating geodesic distances between superpixels and edge pixels, but the method mainly depends on edge features and motion gradient features of images, and noise regions are easily generated in complex texture scenes. The CIS algorithm uses the concept of a generative countermeasure network for reference, and can detect a moving object well by distinguishing a background from the moving object by a defined information reduction rate based on optical flow information. However, for the existing unsupervised target tracking algorithm, the performance of many algorithms is degraded when the target image is shot due to factors such as angle, illumination, shielding, background interference, noise caused by equipment, and the like.
Disclosure of Invention
The invention aims to provide an unsupervised moving object detection method based on an information reduction rate, which fully utilizes optical flow information of an object and a background, fuses characteristic channels of a moving object through an attention mechanism, reduces interference of the background and improves the detection performance of the moving object.
The technical solution for realizing the purpose of the invention is as follows: an unsupervised moving object detection method based on information reduction rate comprises the following steps:
step 1: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database;
step 2: calculating to obtain an optical flow image corresponding to the video sequence through the trained PWCNet, and normalizing;
and step 3: taking a video sequence and a corresponding optical flow image thereof as input, and training a generating type confrontation network model;
and 4, step 4: the video sequence to be detected is processed in the steps 1 to 2;
and 5: and extracting a generator module of the trained generative confrontation network model, and detecting the moving target in the video sequence to be detected.
Compared with the prior art, the invention has the following remarkable advantages: (1) constructing a generating type confrontation network model according to the relation of the optical flow based on the property that the background image area does not contain the information of the foreground image area to realize the judgment of the background and the moving target, wherein the generating type confrontation network comprises a generator and a restorer; an attention mechanism is introduced, so that the robustness of a tracking algorithm is effectively improved, and the interference of background noise and the like on the tracking of the target is reduced; (2) the light stream information of the target and the background is fully utilized, the characteristic channels of the moving target are fused through an attention mechanism, the interference of the background is reduced, and the detection performance of the moving target is improved.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
Fig. 2 is a diagram of a basic network architecture of the present invention.
Fig. 3 is a diagram of the detection output result of the generator module of the network model for the moving object in the partial video sequence.
FIG. 4 is a network architecture diagram of a generator module of the network model.
Detailed Description
The invention relates to an unsupervised moving target detection method based on information reduction rate, which comprises the following steps:
step 1: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database;
step 2: calculating to obtain an optical flow image corresponding to the video sequence through the trained PWCNet, and normalizing;
and step 3: taking a video sequence and a corresponding optical flow image thereof as input, and training a generating type confrontation network model;
and 4, step 4: the video sequence to be detected is processed in the steps 1 to 2;
and 5: and extracting a generator module of the trained generative confrontation network model, and detecting the moving target in the video sequence to be detected.
Further, the training generation type confrontation network model in step 3 specifically comprises the following steps:
step 3.1, distinguishing the moving target from the background:
based on the principle that the background image region should not contain information of the moving object foreground image region, the image of the region of interest can be interpreted as poorly as possible by learning regions outside the region of interest. Specifically, for a frame of image I of a video sequence, an image region is assumed to be D, an image region of a moving object is assumed to be Ω, and a background is assumed to be ΩcD/omega, which flows to the adjacent frame (last one)Frame or next frame) is u. The optical flow represents apparent motion of an image brightness pattern, and includes important information of the surface structure and dynamic behavior of an object. Use of
Figure BDA0003405321570000034
Representing mutual information of two random variables, given optical flows u at positions I, j in image Ii、ujThe concept of the foreground Ω is formalized as an area with mutual information 0 with the background:
Figure BDA0003405321570000031
wherein the mutual information
Figure BDA0003405321570000033
Optical flow u representing a position j in a given image IjOptical flow u provided in relation to position iiThe larger the mutual information value is, the larger the provided information amount is; shannon information entropy H (u)iI) represents uiThe larger the uncertainty of the variable is, the larger the information entropy is, and the value is always larger than 0; h (u)i|ujI) is represented in the known ujUnder the conditions of (a) uiUncertainty of (d);
step 3.2, loss function based on information reduction rate:
according to the foreground and the background defined above, an information reduction rate is defined to construct an optimization target by combining a Shannon information entropy theory; using two subsets in D, namely an area x and an area y as input, wherein the optical flows of the area x and the area y are respectively ux、uyThe information reduction rate γ is defined as follows:
Figure BDA0003405321570000032
wherein the content of the first and second substances,
Figure BDA0003405321570000035
representing a given graphOptical flow u of area y in image IyThe luminous flux u about the area x that can be providedxThe amount of information of (a); shannon information entropy H (u)xI) represents uxUncertainty of (d); hux|uyI) is represented in the known uyUnder the conditions of (a) uxUncertainty of (d);
gamma (x | y; I) represents known uyUnder the conditions of (a) uxThe value of the uncertainty reduction amount of (a) is between 0 and 1; when u isxAnd uyIndependently, i.e. one belongs to the foreground and one to the background image area, γ is 0; u for optical flow in target image region Ωin={uiI ∈ Ω } and background region ΩcU inout={uj,j∈ΩcRepresents, thus, there are:
Figure BDA0003405321570000041
wherein, P (u)inI) represents the probability that the optical flow is foreground optical flow, P (u)in|uoutI) represents known uoutUnder the conditions of (a) uinThe probability of (d); defining a loss function as
Figure BDA00034053215700000414
When in use
Figure BDA00034053215700000415
At a minimum, the optical flow of the background is sufficient to predict the foreground;
strict assumptions are made about the model as follows:
Figure BDA0003405321570000042
wherein phi (omega, y, I) ═ u-indP(uin|uout,I);||||2Representing the vector modulo, sigma representing the variance;
meanwhile, introducing a function chi to express D, omega and omegac
Figure BDA0003405321570000043
Therefore, the optical flow into Ω is represented by ui in=χuiThe outflow is ui out=-1-χ)ui
Finally, selecting chi and phi as parameter function classes in the convolutional neural network, and expressing parameters by w, wherein the corresponding function is
Figure BDA0003405321570000044
And
Figure BDA0003405321570000045
to simplify the representation, the loss function is omitted
Figure BDA0003405321570000046
And converting the constant term into the inverse of the original loss function to obtain the final loss function
Figure BDA0003405321570000047
Figure BDA0003405321570000048
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003405321570000049
to recover i, the above equation is minimized, w1Is a parameter of restorer i;
Figure BDA00034053215700000410
to generator g, select
Figure BDA00034053215700000411
To make u geti outIs not ui inProviding information so that the above formula is maximized, w2To the parameters of the generator g(ii) a I is an image;
finally, the optimization objective
Figure BDA00034053215700000412
Expressed in the following form:
Figure BDA00034053215700000413
3.3, constructing a generator g and a restorer i, wherein the generator g and the restorer i jointly form a generating type countermeasure network, and solving the optimization problem in the step 3.2; the generator g is used for generating an optical flow mask image mask of the moving object; the restorer i restores optical flow information in the mask image according to the mask image generated by the generator g and the corresponding optical flow image by taking the CPN as a basic network architecture;
and 3.4, training the constructed generative confrontation network by using the DAVIS2016 data set to obtain a final generative confrontation network model.
Further, the generator g and the restorer i in step 3.3 jointly form a generative confrontation network, and the specific model is as follows:
1) generator g inputs RGB image ItAnd its corresponding optical flow ut:t+δTOutputting a mask image mask as a moving target, wherein δ T is randomly sampled between uniform distributions U [ -5,5 and δ T ≠ 0, thus introducing more about image ItChange information of optical flow; the generator g consists of an encoder and a decoder; the encoder part consists of 5 convolutional layers, each of which is followed by a BN layer, each of which reduces the original image to 1/4 of the input image; 4 cavity convolution layers with gradually increased radiuses are arranged behind the encoder, and the radiuses are 2, 4, 8 and 16 in sequence; the decoder part is composed of 5 convolution layers and generates a mask image with the same size as the input image through up-sampling;
2) restorer I input as RGB image ItAnd a mask image mask generated by the generator g, which is output as an optical flow image other than the predicted mask image, that is, an optical flow image of the background; recoveryThe encoder part of the device i comprises two branches, the structures and parameters of the two branches are completely the same, the two branches are respectively composed of 9 convolutional layers, and LeakyReLu is used as an activation function after each convolutional layer. One of the network branches takes as input the normalized frame image and the other branch takes as input the optical flow image and the mask image generated by the generator. The features of the two network branch codes are connected by using a splicing operation (concat) and then transmitted to a decoder, the decoder mainly comprises an deconvolution layer and a LeakyReLu activation function, and meanwhile, a jump structure is used for performing feature fusion on deep features after up-sampling with shallow features. And finally outputting the optical flow image with the same size as the input image.
Further, the encoder parts of the generator g and the restorer i of the generative confrontation network model of step 3 introduce a light attention mechanism; the attention module includes channel attention, spatial attention, and global attention.
1) Channel attention mainly includes three operations: squeeze, excite, and recalibrate. Firstly, for a feature graph F with an input size of h multiplied by w multiplied by c, compressing input features in a space dimension through extrusion operation to obtain a feature vector s with a size of 1 multiplied by c, representing the global features of the channel, wherein each element in the feature vector corresponds to each channel in the feature graph, and actually, the process of globally pooling each feature graph is carried out; then, establishing the relevance among the channels through excitation operation, learning the direct relevance of c channels by using the weight w, obtaining a channel weight e with the size of 1 × 1 × c, and generally realizing the channel weight e through 1 × 1 convolution operation; finally, multiplying the weight of the channel by the original input characteristic diagram through recalibration operation to obtain a weighted output characteristic diagram FC′。
2) To feature map F'CTwo feature matrices F are generated using maximum pooling and average pooling operations, respectivelyMAXAnd FAVG(ii) a Then, the two feature matrixes are fused to obtain a fused feature map FMAThe spatial attention weight W is obtained through Sigmoid activation function processing, wherein the fusion operation is generally to simply put through the feature matrixPerforming convolution operation after the channel splicing; finally, multiplying the space attention weight matrix W with the original input feature map F matrix to obtain a weighted output feature map FS′。
3) The global attention squeeze operation is the same as the channel attention, and for the actuation operation therein we replace it with an implementation by 4 consecutive operations: fc (2C/16) → ReLU → fc (1) → Sigmoid, the excitation generating a size selection factor
Figure BDA0003405321570000061
Where fc (·) denotes full connection operation, C is the number of channels, and both ReLU and Sigmoid are activation functions. Output F according to the spatial attention mechanismS' calculation of the sum size selection factor mu yields the feature F for which the size is sensitiveScaleAs shown in the following formula:
FG′=F+(μ*FS′)
in order to avoid losing some important information of the area with the attention value close to 0, an identity mapping item F is added.
Further, in the step 4 to the step 5, a generator module for extracting the trained generative confrontation network model detects a moving target in the video sequence to be detected, and the specific steps are as follows:
firstly, carrying out the preprocessing operation of the step one on a video sequence to be detected;
then, calculating according to the method in the second step to obtain a corresponding optical flow image;
finally, the preprocessed video sequence image and the corresponding optical flow image are input into the generator g obtained in the third step, and the obtained output image is the prediction result of the moving object.
The present invention is further illustrated in the accompanying drawings, which are included to provide a further understanding of the invention and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will become apparent to those skilled in the art after reading the present specification and the appended claims.
Examples
The invention provides an unsupervised moving target detection method with an attention mechanism. Constructing a generating type confrontation network model according to the relation of optical flow based on the property that the background image area does not contain the information of the foreground image area to realize the discrimination of the background and the moving target, wherein the generating type confrontation network comprises a generator and a restorer; an attention mechanism is introduced, so that the robustness of a tracking algorithm is effectively improved, and the interference of background noise and the like on the tracking of the target is reduced. The basic idea is as follows: firstly, a video database is constructed, and the video is preprocessed; then, adjacent frames of each video are calculated by using the PWCNet to obtain optical flow information; secondly, the video obtained through preprocessing and the corresponding optical flow information are used as the input of a generating type countermeasure network based on an attention mechanism, and a network model is trained; and finally, for the video sequence to be detected, a generator module of the network model is used for obtaining a detection result of the moving target.
As shown in fig. 1, the implementation of the present invention mainly comprises four steps: (1) preprocessing a video sequence; (2) obtaining an optical flow image of the video sequence through PWCNet; (3) taking a video sequence and an optical flow image corresponding to the video sequence as an input training generation type countermeasure network; (4) and detecting the motion target in the video sequence by utilizing a generator module of the trained network model, and outputting a detection result.
The method comprises the following steps: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database;
because the collected video sequence under the natural scene may have interference of factors such as uneven illumination and the like, the video sequence is preprocessed, and the preprocessing mainly comprises histogram equalization, normalization and the like of the video sequence;
step two: obtaining an optical flow image of the video sequence through PWCNet;
given the optical flow u: D of the image I to be measured for one frame down (up)1→R2Is that
Figure BDA0003405321570000071
To
Figure BDA0003405321570000072
To (3) is performed. PWCNet is a high-performance optical flow learning network, which can efficiently acquire optical flow information of video sequences. The invention adopts PWCNet to calculate optical flow information and carries out normalization, wherein the normalization operation mainly comprises the steps of adjusting the optical flow image to be the same as the video sequence, and then dividing the optical flow image by a constant, namely, reducing the value of the optical flow image in equal proportion to accelerate the training of the network.
Step three: taking a video sequence and an optical flow image corresponding to the video sequence as an input training generation type countermeasure network;
and 3.1, distinguishing the moving object from the background. Based on the principle that the background image region should not contain information of the moving object foreground image region, the image of the region of interest can be interpreted as poorly as possible by learning regions outside the region of interest. Specifically, for a frame of image I of a video sequence, an image region is assumed to be D, an image region of a moving object is assumed to be Ω, and a background is assumed to be ΩcD/Ω, and the optical flow to an adjacent frame (the previous frame or the next frame) is u. The optical flow represents the apparent motion of the image brightness pattern, and contains important information of the surface structure and dynamic behavior of the object. Use of
Figure BDA0003405321570000073
Representing mutual information of two random variables, given optical flows u at two locations in an image Ii、ujThe concept of foreground Ω can be formalized as an area with 0 mutual information with the background: :
Figure BDA0003405321570000074
wherein the mutual information
Figure BDA0003405321570000075
Representing the optical flow ujProvided with respect to predicted optical flow uiThe larger the value, the larger the amount of information provided;
Figure BDA0003405321570000076
the information entropy is represented and used for quantifying the size of the information quantity, the larger the uncertainty of the variable is, the larger the information entropy is, and the value is always larger than 0.
And 3.2, based on the loss function of the information reduction rate. According to the foreground and the background defined above, an information reduction rate is defined to construct an optimization target by combining a Shannon information entropy theory; with two subsets (regions) x, y in D as inputs, the information reduction rate γ is defined as follows:
Figure BDA0003405321570000081
wherein the content of the first and second substances,
Figure BDA0003405321570000082
representing the optical flow uyProvided with respect to predicted optical flow uxThe amount of information of (a); shannon information entropy H (u)x|I)uxRepresents uxUncertainty of (d); hux|uyI) is represented in the known uyUnder the conditions of (a) uxUncertainty of (d); gamma (x | y; I) represents known uyUnder the conditions of (a) uxThe value of the uncertainty reduction of (a) is between 0 and 1. In particular, when uxAnd uyIndependently, i.e. one belongs to the foreground and one to the background image area, γ is 0; u for optical flow in target image region Ωin={uiI ∈ Ω } and background region ΩcU inout={uj,j∈ΩcRepresents, thus, there are:
Figure BDA0003405321570000083
wherein, P (u)inI) represents the probability that the optical flow is foreground optical flow, P (u)in|uoutI) represents known uoutUnder the conditions of (a) uinThe probability of (c). Defining a loss function as
Figure BDA0003405321570000084
When in use
Figure BDA0003405321570000085
At a minimum, the optical flow of the background is sufficient to predict the foreground. Strict assumptions are made about the model as follows:
Figure BDA0003405321570000086
wherein phi (omega, y, I) ═ u-indP(uin|uout,I);||||2Representing the vector modulo, sigma the variance. Meanwhile, introducing a function chi to express D, omega and omegac
Figure BDA0003405321570000087
Therefore, the optical flow into Ω is represented by ui in=χuiThe outflow is ui out=(1-χ)ui
Finally, χ and φ are selected as classes of parametric functions in the convolutional neural network, with w representing the parameter and the corresponding function being
Figure BDA0003405321570000088
And
Figure BDA0003405321570000089
to simplify the representation, the loss function is omitted
Figure BDA00034053215700000810
And the constant term of (2) is converted into the inverse number of the original loss function, so that the final loss function can be obtained
Figure BDA00034053215700000811
Figure BDA0003405321570000091
Wherein the content of the first and second substances,
Figure BDA0003405321570000092
to recover i, the above equation is minimized, w1Is a parameter thereof;
Figure BDA0003405321570000093
to generate g, the appropriate one is selected
Figure BDA0003405321570000094
So that u isi outIs not ui inProviding information so that the above formula is maximized, w2Is a parameter thereof; and I is an image.
Finally, the optimization objective
Figure BDA0003405321570000095
Expressed in the following form:
Figure BDA0003405321570000096
and 3.3, constructing a generator g and a restorer i which jointly form a generative countermeasure network, and effectively solving the optimization problem in the step 3.2. The generator g includes encoder and decoder sections for generating an optical flow mask image of the moving object, the network structure and parameters of which are shown in table 1. The restorer i includes an encoder and a decoder part, and optical flow information other than the mask image can be restored from the mask image generated by the generator g, and its network structure and parameters are shown in table 2.
1) Generator g inputs RGB image ItAnd its corresponding optical flow ut:t+δTOutputting a mask image mask of the moving target, wherein the delta T is uniformly distributed in U [ -5,5 [ ]]With random sampling and δ T ≠ 0, which introduces more about the image ItChange information of optical flow; the generator g consists of an encoder and a decoder; the encoder part consists of 5 convolutional layers eachEach convolution layer is followed by a BatchNormalization layer, and each convolution layer reduces the original image to 1/4 of the input image; 4 cavity convolution layers with gradually increased radiuses are arranged behind the encoder, and the radiuses are 2, 4, 8 and 16 in sequence; the decoder part is composed of 5 convolution layers and generates a mask image with the same size as the input image through up-sampling;
2) restorer I input as RGB image ItAnd a mask image mask generated by the generator g, which is output as an optical flow image other than the predicted mask image, that is, an optical flow image of the background; the encoder part of restorer i comprises two branches, and the structure and parameters of the two branches are completely the same, and are respectively composed of 9 convolutional layers, and LeakyReLu is used as an activation function after each convolutional layer. One of the network branches takes as input the normalized frame image and the other branch takes as input the optical flow image and the mask image generated by the generator. The features of the two network branch codes are connected by using a splicing operation (concat) and then transmitted to a decoder, the decoder mainly comprises an deconvolution layer and a LeakyReLu activation function, and meanwhile, a jump structure is used for performing feature fusion on the deep features and the shallow features after up-sampling. And finally outputting the optical flow image with the same size as the input image.
TABLE 1 Generator network parameters
Figure BDA0003405321570000101
Note 1: each convolution layer is followed by a Batch Normalization, which is not shown.
Note 2: the hole convolution fills rate-1 0 in the middle of a convolution kernel, so that the receptive field can be enlarged, and multi-scale context information can be captured.
Note 3: the deconvolution layer can realize signal restoration and up-sampling.
Note 4: attention modules are added to convolutional layers 2-3, 4-5, 7-10, 11-12 to reduce the interference of background noise.
Table 2 restorer network parameters
Figure BDA0003405321570000102
And 3.4, training the constructed generative confrontation network by using the training data set to obtain a final network model.
Step four: detecting a moving target in a video sequence by using a generator g of the trained network model;
firstly, carrying out the preprocessing operation of the step one on a video sequence to be detected; then, calculating according to the method in the second step to obtain a corresponding optical flow image; and finally, inputting the preprocessed video sequence image and the corresponding optical flow image into a generator g obtained in the third step, wherein the obtained output image is the mask image of the moving target.
The invention relates to an unsupervised moving target detection method based on information reduction rate, which is characterized in that a generative confrontation network model is constructed according to the relation of optical flow based on the property that a background image area does not contain information of a foreground image area, so as to realize the discrimination of a background and a moving target, and the generative confrontation network comprises a generator and a restorer; the attention mechanism is introduced, so that the robustness of a tracking algorithm is effectively improved, and the interference of background noise and the like on the tracking of the target is reduced. The basic idea is as follows: firstly, constructing a video database, and preprocessing videos; then, adjacent frames of each video are calculated by using PWCNet to obtain optical flow information; then, the video obtained through preprocessing and the corresponding optical flow information are used as the input of a generating type countermeasure network based on an attention mechanism, and a network model is trained; and finally, for the video sequence to be detected, a generator module of the network model is used for obtaining a detection result of the moving target. Compared with the existing unsupervised moving object detection algorithm, the method fully utilizes the optical flow information of the object and the background, fuses the characteristic channels of the moving object through an attention mechanism, reduces the interference of the background, and improves the detection performance of the moving object.

Claims (5)

1. An unsupervised moving object detection method based on information reduction rate is characterized by comprising the following steps:
step 1: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database;
step 2: calculating to obtain an optical flow image corresponding to the video sequence through the trained PWCNet, and normalizing;
and step 3: taking a video sequence and a corresponding optical flow image thereof as input, and training a generating type confrontation network model;
and 4, step 4: the video sequence to be detected is processed in the steps 1 to 2;
and 5: and extracting a generator module of the trained generative confrontation network model, and detecting the moving target in the video sequence to be detected.
2. The unsupervised moving object detecting method based on information reduction rate as claimed in claim 1, wherein the training generating confrontation network model in step 3 specifically comprises the following steps:
step 3.1, distinguishing the moving target from the background:
for a frame of image I of a video sequence, an image area is assumed to be D, an image area of a moving object is assumed to be omega, and a background is assumed to be omegacD/Ω, the optical flow of the current frame flowing to the adjacent frame is u, and the adjacent frame is the previous frame or the next frame; wherein the optical flow represents the apparent motion of the image brightness mode and contains the information of the surface structure and dynamic behavior of the object; use of
Figure FDA0003405321560000011
Representing mutual information of two random variables, giving optical flow u at position I, position j in image Ii、ujThe concept of the foreground Ω is formalized as an area with 0 mutual information with the background:
Figure FDA0003405321560000012
wherein the mutual information
Figure FDA0003405321560000013
Optical flow u representing a position j in a given image IjOptical flow u provided in relation to position iiThe larger the mutual information value is, the larger the provided information amount is;
shannon information entropy H (u)iI) represents uiThe larger the uncertainty of the variable is, the larger the information entropy is, and the value is always larger than 0; h (u)i|ujI) is represented in the known ujUnder the conditions of (a) uiUncertainty of (d);
step 3.2, loss function based on information reduction rate:
according to the foreground and the background defined above, the Shannon information entropy theory is combined, and the information reduction rate is defined to construct an optimization target; using two subsets in D, namely area x and area y as input, and the optical flows of area x and area y are respectively ux、uyThe information reduction rate γ is defined as follows:
Figure FDA0003405321560000014
wherein the content of the first and second substances,
Figure FDA0003405321560000015
optical flow u representing an area y in a given image IyOptical flow u about area x that can be providedxThe amount of information of (a); shannon information entropy H (u)xI) represents uxUncertainty of (d); h (u)x|uyI) is represented in the known uyUnder the conditions of (a) uxUncertainty of (d);
gamma (x | y; I) represents known uyUnder the conditions of (a) uxThe value of the uncertainty reduction amount of (a) is between 0 and 1; when u isxAnd uyIndependently, i.e. one belongs to the foreground and one to the background image area, γ is 0; u for optical flow in target image region Ωin={uiI ∈ Ω } and background region ΩcU inout={uj,j∈ΩcRepresents, thus, there are:
Figure FDA0003405321560000021
wherein, P (u)inI) represents the probability that the optical flow is foreground optical flow, P (u)in|uoutI) represents known uoutUnder the conditions of (a) uinThe probability of (d); defining a loss function as
Figure FDA0003405321560000022
When in use
Figure FDA0003405321560000023
At a minimum, the optical flow of the background is sufficient to predict the foreground;
strict assumptions are made about the model as follows:
Figure FDA0003405321560000024
wherein phi (omega, y, I) ═ u-indP(uin|uout,I);|| ||2Representing the vector modulo, sigma representing the variance;
meanwhile, introducing a function chi to express D, omega and omegac
Figure FDA0003405321560000025
Therefore, the optical flow into Ω is represented by ui in=χuiThe outflow is ui out=(1-χ)ui
Finally, χ and φ are selected as classes of parametric functions in the convolutional neural network, with w representing the parameter and the corresponding function being
Figure FDA0003405321560000026
And
Figure FDA0003405321560000027
omitting loss functions
Figure FDA0003405321560000028
And (4) converting the constant term into the inverse number of the original loss function to obtain the final loss function
Figure FDA0003405321560000029
Figure FDA00034053215600000210
Wherein the content of the first and second substances,
Figure FDA0003405321560000031
to recover i, the above equation is minimized, w1Is a parameter of restorer i;
Figure FDA0003405321560000032
to generator g, select
Figure FDA0003405321560000033
So that u isi outIs not ui inProviding information so that the above formula is maximized, w2Are the parameters of generator g; i is an image;
final optimization objective
Figure FDA0003405321560000034
Expressed in the following form:
Figure FDA0003405321560000035
3.3, constructing a generator g and a restorer i, wherein the generator g and the restorer i jointly form a generating type countermeasure network, and solving the optimization problem in the step 3.2; the generator g is used for generating an optical flow mask image mask of the moving target; the restorer i restores optical flow information in the mask image according to the mask image generated by the generator g and the corresponding optical flow image by taking the CPN as a basic network architecture;
and 3.4, training the constructed generative confrontation network by using the DAVIS2016 data set to obtain a final generative confrontation network model.
3. The unsupervised moving object detecting method based on information reduction rate as claimed in claim 2, wherein the generator g and the restorer i in step 3.3 together form a generative confrontation network, and the concrete model is as follows:
1) generator g inputs RGB image ItAnd its corresponding optical flow ut:t+δTOutputting a mask image mask of the moving target, wherein the delta T is uniformly distributed in U [ -5,5 [ ]]With δ T ≠ 0, thus introducing more about the image ItChange information of optical flow; the generator g consists of an encoder and a decoder; the encoder portion consists of 5 convolutional layers, each followed by a BN layer, each convolutional layer reducing the original image to 1/4 of the input image; 4 cavity convolution layers with gradually increased radiuses are arranged behind the encoder, and the radiuses are 2, 4, 8 and 16 in sequence; the decoder part is composed of 5 convolution layers and generates a mask image with the same size as the input image through up-sampling;
2) restorer I input as RGB image ItAnd a mask image mask generated by the generator g, which is output as an optical flow image other than the predicted mask image, that is, an optical flow image of the background; the encoder part of the restorer i comprises two branches, the structures and parameters of the two branches are completely the same, the two branches are respectively composed of 9 convolutional layers, and LeakyReLu is used as an activation function after each convolutional layer; one of the network branches takes the normalized frame image as input, and the other branch takes the optical flow image and the mask image generated by the generator as input; concatenating two network branch encoded features using a concatenation operation concatThen transmitting the data to a decoder, wherein the decoder mainly comprises an deconvolution layer and a LeakyReLu activation function, and simultaneously performing feature fusion on deep features and shallow features after up-sampling the deep features by using a jump structure; and finally outputting the optical flow image with the same size as the input image.
4. The unsupervised moving object detecting method based on information reduction rate as claimed in claim 3, wherein the encoder part of the generator g and restorer i of the generative confrontation network model in step 3 introduces a lightweight attention mechanism, and the attention module includes channel attention, spatial attention and global attention:
1) the channel attention includes three operations: extruding, exciting and recalibrating; firstly, for a feature graph F with an input size of h multiplied by w multiplied by c, compressing input features in a space dimension through extrusion operation to obtain a feature vector s with the size of 1 multiplied by c, representing the global features of a channel, wherein each element in the feature vector corresponds to each channel in the feature graph, and actually, the process of performing global pooling on each feature graph is performed; then, establishing relevance among channels through excitation operation, learning direct relevance of c channels by using the weight w, obtaining a channel weight e with the size of 1 multiplied by c, and realizing the relevance through 1 multiplied by 1 convolution operation; finally, multiplying the weight of the channel by the original input feature map through a recalibration operation to obtain a weighted output feature map F'C
2) To feature map F'CTwo feature matrices F are generated using maximum pooling and average pooling operations, respectivelyMAXAnd FAVG(ii) a Then, the two feature matrixes are fused to obtain a fused feature map FMAObtaining a space attention weight W through Sigmoid activated function processing, wherein the fusion operation comprises splicing the feature matrixes according to channels and then performing convolution operation; finally, multiplying the space attention weight matrix W with the original input feature map F matrix to obtain a weighted output feature map F'S
3) The global attention squeeze operation is the same as the channel attention, and for the excitation operation therein we willIt is replaced by an implementation of 4 consecutive operations: fc (2C/16) → ReLU → fc (1) → Sigmoid, the excitation generating a size selection factor
Figure FDA0003405321560000041
Wherein fc (·) represents full connection operation, C is the number of channels, and both ReLU and Sigmoid are activation functions; output F 'produced from space attention machine'SAnd a size selection factor mu are calculated to obtain a size-sensitive feature F'ScaleAs shown in the following formula:
F′G=F+(μ*F′S)
in order to avoid losing important information of the area with the attention value close to 0, an identity mapping item F is added.
5. The unsupervised moving object detection method based on information reduction rate as claimed in claim 4, wherein in the steps 4-5, the generator module of the trained generating confrontation network model is extracted to detect the moving object in the video sequence to be detected, and the specific steps are as follows:
firstly, carrying out the preprocessing operation of the step 1 on a video sequence to be detected;
then, calculating according to the method in the step 2 to obtain a corresponding optical flow image;
finally, the preprocessed video sequence images and the corresponding optical flow images are input into the generator g obtained in step 3, and the obtained output images are the prediction results of the moving objects.
CN202111510928.0A 2021-12-10 2021-12-10 Unsupervised moving object detection method based on information reduction rate Pending CN114494934A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111510928.0A CN114494934A (en) 2021-12-10 2021-12-10 Unsupervised moving object detection method based on information reduction rate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111510928.0A CN114494934A (en) 2021-12-10 2021-12-10 Unsupervised moving object detection method based on information reduction rate

Publications (1)

Publication Number Publication Date
CN114494934A true CN114494934A (en) 2022-05-13

Family

ID=81492078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111510928.0A Pending CN114494934A (en) 2021-12-10 2021-12-10 Unsupervised moving object detection method based on information reduction rate

Country Status (1)

Country Link
CN (1) CN114494934A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229336A (en) * 2023-05-10 2023-06-06 江西云眼视界科技股份有限公司 Video moving target identification method, system, storage medium and computer

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229336A (en) * 2023-05-10 2023-06-06 江西云眼视界科技股份有限公司 Video moving target identification method, system, storage medium and computer
CN116229336B (en) * 2023-05-10 2023-08-18 江西云眼视界科技股份有限公司 Video moving target identification method, system, storage medium and computer

Similar Documents

Publication Publication Date Title
CN112347859B (en) Method for detecting significance target of optical remote sensing image
CN108256562B (en) Salient target detection method and system based on weak supervision time-space cascade neural network
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN111523410B (en) Video saliency target detection method based on attention mechanism
CN113628249B (en) RGBT target tracking method based on cross-modal attention mechanism and twin structure
CN107239730B (en) Quaternion deep neural network model method for intelligent automobile traffic sign recognition
CN113221641B (en) Video pedestrian re-identification method based on generation of antagonism network and attention mechanism
WO2019136591A1 (en) Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network
CN111444924B (en) Method and system for detecting plant diseases and insect pests and analyzing disaster grade
CN111461129B (en) Context prior-based scene segmentation method and system
Maslov et al. Online supervised attention-based recurrent depth estimation from monocular video
CN114638836A (en) Urban street view segmentation method based on highly effective drive and multi-level feature fusion
Manssor et al. Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network
CN116012395A (en) Multi-scale fusion smoke segmentation method based on depth separable convolution
CN116402851A (en) Infrared dim target tracking method under complex background
CN110688966B (en) Semantic guidance pedestrian re-recognition method
CN116485867A (en) Structured scene depth estimation method for automatic driving
CN113536977B (en) 360-degree panoramic image-oriented saliency target detection method
CN114494934A (en) Unsupervised moving object detection method based on information reduction rate
Ma et al. A lightweight neural network for crowd analysis of images with congested scenes
CN110942463B (en) Video target segmentation method based on generation countermeasure network
CN117522923A (en) Target tracking system and method integrating multi-mode characteristics
CN117197438A (en) Target detection method based on visual saliency
CN117011515A (en) Interactive image segmentation model based on attention mechanism and segmentation method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination