CN114494934A - Unsupervised moving object detection method based on information reduction rate - Google Patents
Unsupervised moving object detection method based on information reduction rate Download PDFInfo
- Publication number
- CN114494934A CN114494934A CN202111510928.0A CN202111510928A CN114494934A CN 114494934 A CN114494934 A CN 114494934A CN 202111510928 A CN202111510928 A CN 202111510928A CN 114494934 A CN114494934 A CN 114494934A
- Authority
- CN
- China
- Prior art keywords
- image
- optical flow
- information
- video sequence
- generator
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/269—Analysis of motion using gradient-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Biology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an unsupervised moving target detection method based on an information reduction rate. The method comprises the following steps: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database; calculating to obtain an optical flow image corresponding to the video sequence through the trained PWCNet and normalizing; taking a video sequence and a corresponding optical flow image thereof as input, and training a generating type confrontation network model; the video sequence to be detected is processed in the same way; and extracting a generator module of the trained generative confrontation network model, and detecting the moving target in the video sequence to be detected. Based on the property that the background image area does not contain information of the foreground image area, a generating type confrontation network model is constructed according to the relation of optical flow to realize the discrimination of the background and the moving target, and the generating type confrontation network comprises a generator and a restorer; the characteristic channels of the moving target are fused through an attention mechanism, so that the background interference is reduced, and the detection performance of the moving target is improved.
Description
Technical Field
The invention belongs to the field of deep learning in computer vision, and particularly relates to an unsupervised moving target detection method based on an information reduction rate.
Background
The target detection is an important branch of computer vision, and the main purpose of the target detection is to extract a moving target as a foreground from a video sequence, and the environment around the moving target opposite to the foreground is used as a background and is separated from the moving target. As a cross comprehensive subject, the target detection integrates theories and algorithms in multiple fields of image processing, machine learning, optimization and the like, and is a premise and basis for completing a task of image understanding (such as target behavior recognition) at a higher level. The target detection technology has great research and application values and is widely applied to the fields of intelligent video monitoring, intelligent human-computer interaction, intelligent traffic, visual navigation, unmanned driving, unmanned autonomous flight, battlefield situation reconnaissance and the like. In recent years, with the development of computer technology and deep learning technology, target detection models have been developed and evolved, and various detection models have been created.
In the field of target detection, the average overlap ratio IoU between the target object and the predicted result is often used as a core evaluation criterion. In recent years, studies of target detection can be divided into two categories: one is a supervised learning approach; another class is unsupervised learning methods. The PDB algorithm is a typical supervised algorithm, and spatial features are extracted simultaneously in multiple scales by using a pyramid expansion volume module, and are connected and input to an expanded DB-ConvLSTM structure to learn time domain information, so that a better detection result is obtained. For an unsupervised target detection algorithm, the method has the greatest characteristic that a large number of labeled samples are not needed, and a large development space is provided. SAGE algorithm generates a space-time saliency map to estimate background and foreground information by calculating geodesic distances between superpixels and edge pixels, but the method mainly depends on edge features and motion gradient features of images, and noise regions are easily generated in complex texture scenes. The CIS algorithm uses the concept of a generative countermeasure network for reference, and can detect a moving object well by distinguishing a background from the moving object by a defined information reduction rate based on optical flow information. However, for the existing unsupervised target tracking algorithm, the performance of many algorithms is degraded when the target image is shot due to factors such as angle, illumination, shielding, background interference, noise caused by equipment, and the like.
Disclosure of Invention
The invention aims to provide an unsupervised moving object detection method based on an information reduction rate, which fully utilizes optical flow information of an object and a background, fuses characteristic channels of a moving object through an attention mechanism, reduces interference of the background and improves the detection performance of the moving object.
The technical solution for realizing the purpose of the invention is as follows: an unsupervised moving object detection method based on information reduction rate comprises the following steps:
step 1: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database;
step 2: calculating to obtain an optical flow image corresponding to the video sequence through the trained PWCNet, and normalizing;
and step 3: taking a video sequence and a corresponding optical flow image thereof as input, and training a generating type confrontation network model;
and 4, step 4: the video sequence to be detected is processed in the steps 1 to 2;
and 5: and extracting a generator module of the trained generative confrontation network model, and detecting the moving target in the video sequence to be detected.
Compared with the prior art, the invention has the following remarkable advantages: (1) constructing a generating type confrontation network model according to the relation of the optical flow based on the property that the background image area does not contain the information of the foreground image area to realize the judgment of the background and the moving target, wherein the generating type confrontation network comprises a generator and a restorer; an attention mechanism is introduced, so that the robustness of a tracking algorithm is effectively improved, and the interference of background noise and the like on the tracking of the target is reduced; (2) the light stream information of the target and the background is fully utilized, the characteristic channels of the moving target are fused through an attention mechanism, the interference of the background is reduced, and the detection performance of the moving target is improved.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
Fig. 2 is a diagram of a basic network architecture of the present invention.
Fig. 3 is a diagram of the detection output result of the generator module of the network model for the moving object in the partial video sequence.
FIG. 4 is a network architecture diagram of a generator module of the network model.
Detailed Description
The invention relates to an unsupervised moving target detection method based on information reduction rate, which comprises the following steps:
step 1: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database;
step 2: calculating to obtain an optical flow image corresponding to the video sequence through the trained PWCNet, and normalizing;
and step 3: taking a video sequence and a corresponding optical flow image thereof as input, and training a generating type confrontation network model;
and 4, step 4: the video sequence to be detected is processed in the steps 1 to 2;
and 5: and extracting a generator module of the trained generative confrontation network model, and detecting the moving target in the video sequence to be detected.
Further, the training generation type confrontation network model in step 3 specifically comprises the following steps:
step 3.1, distinguishing the moving target from the background:
based on the principle that the background image region should not contain information of the moving object foreground image region, the image of the region of interest can be interpreted as poorly as possible by learning regions outside the region of interest. Specifically, for a frame of image I of a video sequence, an image region is assumed to be D, an image region of a moving object is assumed to be Ω, and a background is assumed to be ΩcD/omega, which flows to the adjacent frame (last one)Frame or next frame) is u. The optical flow represents apparent motion of an image brightness pattern, and includes important information of the surface structure and dynamic behavior of an object. Use ofRepresenting mutual information of two random variables, given optical flows u at positions I, j in image Ii、ujThe concept of the foreground Ω is formalized as an area with mutual information 0 with the background:
wherein the mutual informationOptical flow u representing a position j in a given image IjOptical flow u provided in relation to position iiThe larger the mutual information value is, the larger the provided information amount is; shannon information entropy H (u)iI) represents uiThe larger the uncertainty of the variable is, the larger the information entropy is, and the value is always larger than 0; h (u)i|ujI) is represented in the known ujUnder the conditions of (a) uiUncertainty of (d);
step 3.2, loss function based on information reduction rate:
according to the foreground and the background defined above, an information reduction rate is defined to construct an optimization target by combining a Shannon information entropy theory; using two subsets in D, namely an area x and an area y as input, wherein the optical flows of the area x and the area y are respectively ux、uyThe information reduction rate γ is defined as follows:
wherein the content of the first and second substances,representing a given graphOptical flow u of area y in image IyThe luminous flux u about the area x that can be providedxThe amount of information of (a); shannon information entropy H (u)xI) represents uxUncertainty of (d); hux|uyI) is represented in the known uyUnder the conditions of (a) uxUncertainty of (d);
gamma (x | y; I) represents known uyUnder the conditions of (a) uxThe value of the uncertainty reduction amount of (a) is between 0 and 1; when u isxAnd uyIndependently, i.e. one belongs to the foreground and one to the background image area, γ is 0; u for optical flow in target image region Ωin={uiI ∈ Ω } and background region ΩcU inout={uj,j∈ΩcRepresents, thus, there are:
wherein, P (u)inI) represents the probability that the optical flow is foreground optical flow, P (u)in|uoutI) represents known uoutUnder the conditions of (a) uinThe probability of (d); defining a loss function asWhen in useAt a minimum, the optical flow of the background is sufficient to predict the foreground;
strict assumptions are made about the model as follows:
wherein phi (omega, y, I) ═ u-indP(uin|uout,I);||||2Representing the vector modulo, sigma representing the variance;
meanwhile, introducing a function chi to express D, omega and omegac:
Therefore, the optical flow into Ω is represented by ui in=χuiThe outflow is ui out=-1-χ)ui;
Finally, selecting chi and phi as parameter function classes in the convolutional neural network, and expressing parameters by w, wherein the corresponding function isAnd
to simplify the representation, the loss function is omittedAnd converting the constant term into the inverse of the original loss function to obtain the final loss function
Wherein, the first and the second end of the pipe are connected with each other,to recover i, the above equation is minimized, w1Is a parameter of restorer i;to generator g, selectTo make u geti outIs not ui inProviding information so that the above formula is maximized, w2To the parameters of the generator g(ii) a I is an image;
3.3, constructing a generator g and a restorer i, wherein the generator g and the restorer i jointly form a generating type countermeasure network, and solving the optimization problem in the step 3.2; the generator g is used for generating an optical flow mask image mask of the moving object; the restorer i restores optical flow information in the mask image according to the mask image generated by the generator g and the corresponding optical flow image by taking the CPN as a basic network architecture;
and 3.4, training the constructed generative confrontation network by using the DAVIS2016 data set to obtain a final generative confrontation network model.
Further, the generator g and the restorer i in step 3.3 jointly form a generative confrontation network, and the specific model is as follows:
1) generator g inputs RGB image ItAnd its corresponding optical flow ut:t+δTOutputting a mask image mask as a moving target, wherein δ T is randomly sampled between uniform distributions U [ -5,5 and δ T ≠ 0, thus introducing more about image ItChange information of optical flow; the generator g consists of an encoder and a decoder; the encoder part consists of 5 convolutional layers, each of which is followed by a BN layer, each of which reduces the original image to 1/4 of the input image; 4 cavity convolution layers with gradually increased radiuses are arranged behind the encoder, and the radiuses are 2, 4, 8 and 16 in sequence; the decoder part is composed of 5 convolution layers and generates a mask image with the same size as the input image through up-sampling;
2) restorer I input as RGB image ItAnd a mask image mask generated by the generator g, which is output as an optical flow image other than the predicted mask image, that is, an optical flow image of the background; recoveryThe encoder part of the device i comprises two branches, the structures and parameters of the two branches are completely the same, the two branches are respectively composed of 9 convolutional layers, and LeakyReLu is used as an activation function after each convolutional layer. One of the network branches takes as input the normalized frame image and the other branch takes as input the optical flow image and the mask image generated by the generator. The features of the two network branch codes are connected by using a splicing operation (concat) and then transmitted to a decoder, the decoder mainly comprises an deconvolution layer and a LeakyReLu activation function, and meanwhile, a jump structure is used for performing feature fusion on deep features after up-sampling with shallow features. And finally outputting the optical flow image with the same size as the input image.
Further, the encoder parts of the generator g and the restorer i of the generative confrontation network model of step 3 introduce a light attention mechanism; the attention module includes channel attention, spatial attention, and global attention.
1) Channel attention mainly includes three operations: squeeze, excite, and recalibrate. Firstly, for a feature graph F with an input size of h multiplied by w multiplied by c, compressing input features in a space dimension through extrusion operation to obtain a feature vector s with a size of 1 multiplied by c, representing the global features of the channel, wherein each element in the feature vector corresponds to each channel in the feature graph, and actually, the process of globally pooling each feature graph is carried out; then, establishing the relevance among the channels through excitation operation, learning the direct relevance of c channels by using the weight w, obtaining a channel weight e with the size of 1 × 1 × c, and generally realizing the channel weight e through 1 × 1 convolution operation; finally, multiplying the weight of the channel by the original input characteristic diagram through recalibration operation to obtain a weighted output characteristic diagram FC′。
2) To feature map F'CTwo feature matrices F are generated using maximum pooling and average pooling operations, respectivelyMAXAnd FAVG(ii) a Then, the two feature matrixes are fused to obtain a fused feature map FMAThe spatial attention weight W is obtained through Sigmoid activation function processing, wherein the fusion operation is generally to simply put through the feature matrixPerforming convolution operation after the channel splicing; finally, multiplying the space attention weight matrix W with the original input feature map F matrix to obtain a weighted output feature map FS′。
3) The global attention squeeze operation is the same as the channel attention, and for the actuation operation therein we replace it with an implementation by 4 consecutive operations: fc (2C/16) → ReLU → fc (1) → Sigmoid, the excitation generating a size selection factorWhere fc (·) denotes full connection operation, C is the number of channels, and both ReLU and Sigmoid are activation functions. Output F according to the spatial attention mechanismS' calculation of the sum size selection factor mu yields the feature F for which the size is sensitiveS′caleAs shown in the following formula:
FG′=F+(μ*FS′)
in order to avoid losing some important information of the area with the attention value close to 0, an identity mapping item F is added.
Further, in the step 4 to the step 5, a generator module for extracting the trained generative confrontation network model detects a moving target in the video sequence to be detected, and the specific steps are as follows:
firstly, carrying out the preprocessing operation of the step one on a video sequence to be detected;
then, calculating according to the method in the second step to obtain a corresponding optical flow image;
finally, the preprocessed video sequence image and the corresponding optical flow image are input into the generator g obtained in the third step, and the obtained output image is the prediction result of the moving object.
The present invention is further illustrated in the accompanying drawings, which are included to provide a further understanding of the invention and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will become apparent to those skilled in the art after reading the present specification and the appended claims.
Examples
The invention provides an unsupervised moving target detection method with an attention mechanism. Constructing a generating type confrontation network model according to the relation of optical flow based on the property that the background image area does not contain the information of the foreground image area to realize the discrimination of the background and the moving target, wherein the generating type confrontation network comprises a generator and a restorer; an attention mechanism is introduced, so that the robustness of a tracking algorithm is effectively improved, and the interference of background noise and the like on the tracking of the target is reduced. The basic idea is as follows: firstly, a video database is constructed, and the video is preprocessed; then, adjacent frames of each video are calculated by using the PWCNet to obtain optical flow information; secondly, the video obtained through preprocessing and the corresponding optical flow information are used as the input of a generating type countermeasure network based on an attention mechanism, and a network model is trained; and finally, for the video sequence to be detected, a generator module of the network model is used for obtaining a detection result of the moving target.
As shown in fig. 1, the implementation of the present invention mainly comprises four steps: (1) preprocessing a video sequence; (2) obtaining an optical flow image of the video sequence through PWCNet; (3) taking a video sequence and an optical flow image corresponding to the video sequence as an input training generation type countermeasure network; (4) and detecting the motion target in the video sequence by utilizing a generator module of the trained network model, and outputting a detection result.
The method comprises the following steps: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database;
because the collected video sequence under the natural scene may have interference of factors such as uneven illumination and the like, the video sequence is preprocessed, and the preprocessing mainly comprises histogram equalization, normalization and the like of the video sequence;
step two: obtaining an optical flow image of the video sequence through PWCNet;
given the optical flow u: D of the image I to be measured for one frame down (up)1→R2Is thatToTo (3) is performed. PWCNet is a high-performance optical flow learning network, which can efficiently acquire optical flow information of video sequences. The invention adopts PWCNet to calculate optical flow information and carries out normalization, wherein the normalization operation mainly comprises the steps of adjusting the optical flow image to be the same as the video sequence, and then dividing the optical flow image by a constant, namely, reducing the value of the optical flow image in equal proportion to accelerate the training of the network.
Step three: taking a video sequence and an optical flow image corresponding to the video sequence as an input training generation type countermeasure network;
and 3.1, distinguishing the moving object from the background. Based on the principle that the background image region should not contain information of the moving object foreground image region, the image of the region of interest can be interpreted as poorly as possible by learning regions outside the region of interest. Specifically, for a frame of image I of a video sequence, an image region is assumed to be D, an image region of a moving object is assumed to be Ω, and a background is assumed to be ΩcD/Ω, and the optical flow to an adjacent frame (the previous frame or the next frame) is u. The optical flow represents the apparent motion of the image brightness pattern, and contains important information of the surface structure and dynamic behavior of the object. Use ofRepresenting mutual information of two random variables, given optical flows u at two locations in an image Ii、ujThe concept of foreground Ω can be formalized as an area with 0 mutual information with the background: :
wherein the mutual informationRepresenting the optical flow ujProvided with respect to predicted optical flow uiThe larger the value, the larger the amount of information provided;the information entropy is represented and used for quantifying the size of the information quantity, the larger the uncertainty of the variable is, the larger the information entropy is, and the value is always larger than 0.
And 3.2, based on the loss function of the information reduction rate. According to the foreground and the background defined above, an information reduction rate is defined to construct an optimization target by combining a Shannon information entropy theory; with two subsets (regions) x, y in D as inputs, the information reduction rate γ is defined as follows:
wherein the content of the first and second substances,representing the optical flow uyProvided with respect to predicted optical flow uxThe amount of information of (a); shannon information entropy H (u)x|I)uxRepresents uxUncertainty of (d); hux|uyI) is represented in the known uyUnder the conditions of (a) uxUncertainty of (d); gamma (x | y; I) represents known uyUnder the conditions of (a) uxThe value of the uncertainty reduction of (a) is between 0 and 1. In particular, when uxAnd uyIndependently, i.e. one belongs to the foreground and one to the background image area, γ is 0; u for optical flow in target image region Ωin={uiI ∈ Ω } and background region ΩcU inout={uj,j∈ΩcRepresents, thus, there are:
wherein, P (u)inI) represents the probability that the optical flow is foreground optical flow, P (u)in|uoutI) represents known uoutUnder the conditions of (a) uinThe probability of (c). Defining a loss function asWhen in useAt a minimum, the optical flow of the background is sufficient to predict the foreground. Strict assumptions are made about the model as follows:
wherein phi (omega, y, I) ═ u-indP(uin|uout,I);||||2Representing the vector modulo, sigma the variance. Meanwhile, introducing a function chi to express D, omega and omegac:
Therefore, the optical flow into Ω is represented by ui in=χuiThe outflow is ui out=(1-χ)ui;
Finally, χ and φ are selected as classes of parametric functions in the convolutional neural network, with w representing the parameter and the corresponding function beingAndto simplify the representation, the loss function is omittedAnd the constant term of (2) is converted into the inverse number of the original loss function, so that the final loss function can be obtained
Wherein the content of the first and second substances,to recover i, the above equation is minimized, w1Is a parameter thereof;to generate g, the appropriate one is selectedSo that u isi outIs not ui inProviding information so that the above formula is maximized, w2Is a parameter thereof; and I is an image.
and 3.3, constructing a generator g and a restorer i which jointly form a generative countermeasure network, and effectively solving the optimization problem in the step 3.2. The generator g includes encoder and decoder sections for generating an optical flow mask image of the moving object, the network structure and parameters of which are shown in table 1. The restorer i includes an encoder and a decoder part, and optical flow information other than the mask image can be restored from the mask image generated by the generator g, and its network structure and parameters are shown in table 2.
1) Generator g inputs RGB image ItAnd its corresponding optical flow ut:t+δTOutputting a mask image mask of the moving target, wherein the delta T is uniformly distributed in U [ -5,5 [ ]]With random sampling and δ T ≠ 0, which introduces more about the image ItChange information of optical flow; the generator g consists of an encoder and a decoder; the encoder part consists of 5 convolutional layers eachEach convolution layer is followed by a BatchNormalization layer, and each convolution layer reduces the original image to 1/4 of the input image; 4 cavity convolution layers with gradually increased radiuses are arranged behind the encoder, and the radiuses are 2, 4, 8 and 16 in sequence; the decoder part is composed of 5 convolution layers and generates a mask image with the same size as the input image through up-sampling;
2) restorer I input as RGB image ItAnd a mask image mask generated by the generator g, which is output as an optical flow image other than the predicted mask image, that is, an optical flow image of the background; the encoder part of restorer i comprises two branches, and the structure and parameters of the two branches are completely the same, and are respectively composed of 9 convolutional layers, and LeakyReLu is used as an activation function after each convolutional layer. One of the network branches takes as input the normalized frame image and the other branch takes as input the optical flow image and the mask image generated by the generator. The features of the two network branch codes are connected by using a splicing operation (concat) and then transmitted to a decoder, the decoder mainly comprises an deconvolution layer and a LeakyReLu activation function, and meanwhile, a jump structure is used for performing feature fusion on the deep features and the shallow features after up-sampling. And finally outputting the optical flow image with the same size as the input image.
TABLE 1 Generator network parameters
Note 1: each convolution layer is followed by a Batch Normalization, which is not shown.
Note 2: the hole convolution fills rate-1 0 in the middle of a convolution kernel, so that the receptive field can be enlarged, and multi-scale context information can be captured.
Note 3: the deconvolution layer can realize signal restoration and up-sampling.
Note 4: attention modules are added to convolutional layers 2-3, 4-5, 7-10, 11-12 to reduce the interference of background noise.
Table 2 restorer network parameters
And 3.4, training the constructed generative confrontation network by using the training data set to obtain a final network model.
Step four: detecting a moving target in a video sequence by using a generator g of the trained network model;
firstly, carrying out the preprocessing operation of the step one on a video sequence to be detected; then, calculating according to the method in the second step to obtain a corresponding optical flow image; and finally, inputting the preprocessed video sequence image and the corresponding optical flow image into a generator g obtained in the third step, wherein the obtained output image is the mask image of the moving target.
The invention relates to an unsupervised moving target detection method based on information reduction rate, which is characterized in that a generative confrontation network model is constructed according to the relation of optical flow based on the property that a background image area does not contain information of a foreground image area, so as to realize the discrimination of a background and a moving target, and the generative confrontation network comprises a generator and a restorer; the attention mechanism is introduced, so that the robustness of a tracking algorithm is effectively improved, and the interference of background noise and the like on the tracking of the target is reduced. The basic idea is as follows: firstly, constructing a video database, and preprocessing videos; then, adjacent frames of each video are calculated by using PWCNet to obtain optical flow information; then, the video obtained through preprocessing and the corresponding optical flow information are used as the input of a generating type countermeasure network based on an attention mechanism, and a network model is trained; and finally, for the video sequence to be detected, a generator module of the network model is used for obtaining a detection result of the moving target. Compared with the existing unsupervised moving object detection algorithm, the method fully utilizes the optical flow information of the object and the background, fuses the characteristic channels of the moving object through an attention mechanism, reduces the interference of the background, and improves the detection performance of the moving object.
Claims (5)
1. An unsupervised moving object detection method based on information reduction rate is characterized by comprising the following steps:
step 1: acquiring a video sequence through a camera, preprocessing the video sequence and constructing a database;
step 2: calculating to obtain an optical flow image corresponding to the video sequence through the trained PWCNet, and normalizing;
and step 3: taking a video sequence and a corresponding optical flow image thereof as input, and training a generating type confrontation network model;
and 4, step 4: the video sequence to be detected is processed in the steps 1 to 2;
and 5: and extracting a generator module of the trained generative confrontation network model, and detecting the moving target in the video sequence to be detected.
2. The unsupervised moving object detecting method based on information reduction rate as claimed in claim 1, wherein the training generating confrontation network model in step 3 specifically comprises the following steps:
step 3.1, distinguishing the moving target from the background:
for a frame of image I of a video sequence, an image area is assumed to be D, an image area of a moving object is assumed to be omega, and a background is assumed to be omegacD/Ω, the optical flow of the current frame flowing to the adjacent frame is u, and the adjacent frame is the previous frame or the next frame; wherein the optical flow represents the apparent motion of the image brightness mode and contains the information of the surface structure and dynamic behavior of the object; use ofRepresenting mutual information of two random variables, giving optical flow u at position I, position j in image Ii、ujThe concept of the foreground Ω is formalized as an area with 0 mutual information with the background:
wherein the mutual informationOptical flow u representing a position j in a given image IjOptical flow u provided in relation to position iiThe larger the mutual information value is, the larger the provided information amount is;
shannon information entropy H (u)iI) represents uiThe larger the uncertainty of the variable is, the larger the information entropy is, and the value is always larger than 0; h (u)i|ujI) is represented in the known ujUnder the conditions of (a) uiUncertainty of (d);
step 3.2, loss function based on information reduction rate:
according to the foreground and the background defined above, the Shannon information entropy theory is combined, and the information reduction rate is defined to construct an optimization target; using two subsets in D, namely area x and area y as input, and the optical flows of area x and area y are respectively ux、uyThe information reduction rate γ is defined as follows:
wherein the content of the first and second substances,optical flow u representing an area y in a given image IyOptical flow u about area x that can be providedxThe amount of information of (a); shannon information entropy H (u)xI) represents uxUncertainty of (d); h (u)x|uyI) is represented in the known uyUnder the conditions of (a) uxUncertainty of (d);
gamma (x | y; I) represents known uyUnder the conditions of (a) uxThe value of the uncertainty reduction amount of (a) is between 0 and 1; when u isxAnd uyIndependently, i.e. one belongs to the foreground and one to the background image area, γ is 0; u for optical flow in target image region Ωin={uiI ∈ Ω } and background region ΩcU inout={uj,j∈ΩcRepresents, thus, there are:
wherein, P (u)inI) represents the probability that the optical flow is foreground optical flow, P (u)in|uoutI) represents known uoutUnder the conditions of (a) uinThe probability of (d); defining a loss function asWhen in useAt a minimum, the optical flow of the background is sufficient to predict the foreground;
strict assumptions are made about the model as follows:
wherein phi (omega, y, I) ═ u-indP(uin|uout,I);|| ||2Representing the vector modulo, sigma representing the variance;
meanwhile, introducing a function chi to express D, omega and omegac:
Therefore, the optical flow into Ω is represented by ui in=χuiThe outflow is ui out=(1-χ)ui;
Finally, χ and φ are selected as classes of parametric functions in the convolutional neural network, with w representing the parameter and the corresponding function beingAnd
omitting loss functionsAnd (4) converting the constant term into the inverse number of the original loss function to obtain the final loss function
Wherein the content of the first and second substances,to recover i, the above equation is minimized, w1Is a parameter of restorer i;to generator g, selectSo that u isi outIs not ui inProviding information so that the above formula is maximized, w2Are the parameters of generator g; i is an image;
3.3, constructing a generator g and a restorer i, wherein the generator g and the restorer i jointly form a generating type countermeasure network, and solving the optimization problem in the step 3.2; the generator g is used for generating an optical flow mask image mask of the moving target; the restorer i restores optical flow information in the mask image according to the mask image generated by the generator g and the corresponding optical flow image by taking the CPN as a basic network architecture;
and 3.4, training the constructed generative confrontation network by using the DAVIS2016 data set to obtain a final generative confrontation network model.
3. The unsupervised moving object detecting method based on information reduction rate as claimed in claim 2, wherein the generator g and the restorer i in step 3.3 together form a generative confrontation network, and the concrete model is as follows:
1) generator g inputs RGB image ItAnd its corresponding optical flow ut:t+δTOutputting a mask image mask of the moving target, wherein the delta T is uniformly distributed in U [ -5,5 [ ]]With δ T ≠ 0, thus introducing more about the image ItChange information of optical flow; the generator g consists of an encoder and a decoder; the encoder portion consists of 5 convolutional layers, each followed by a BN layer, each convolutional layer reducing the original image to 1/4 of the input image; 4 cavity convolution layers with gradually increased radiuses are arranged behind the encoder, and the radiuses are 2, 4, 8 and 16 in sequence; the decoder part is composed of 5 convolution layers and generates a mask image with the same size as the input image through up-sampling;
2) restorer I input as RGB image ItAnd a mask image mask generated by the generator g, which is output as an optical flow image other than the predicted mask image, that is, an optical flow image of the background; the encoder part of the restorer i comprises two branches, the structures and parameters of the two branches are completely the same, the two branches are respectively composed of 9 convolutional layers, and LeakyReLu is used as an activation function after each convolutional layer; one of the network branches takes the normalized frame image as input, and the other branch takes the optical flow image and the mask image generated by the generator as input; concatenating two network branch encoded features using a concatenation operation concatThen transmitting the data to a decoder, wherein the decoder mainly comprises an deconvolution layer and a LeakyReLu activation function, and simultaneously performing feature fusion on deep features and shallow features after up-sampling the deep features by using a jump structure; and finally outputting the optical flow image with the same size as the input image.
4. The unsupervised moving object detecting method based on information reduction rate as claimed in claim 3, wherein the encoder part of the generator g and restorer i of the generative confrontation network model in step 3 introduces a lightweight attention mechanism, and the attention module includes channel attention, spatial attention and global attention:
1) the channel attention includes three operations: extruding, exciting and recalibrating; firstly, for a feature graph F with an input size of h multiplied by w multiplied by c, compressing input features in a space dimension through extrusion operation to obtain a feature vector s with the size of 1 multiplied by c, representing the global features of a channel, wherein each element in the feature vector corresponds to each channel in the feature graph, and actually, the process of performing global pooling on each feature graph is performed; then, establishing relevance among channels through excitation operation, learning direct relevance of c channels by using the weight w, obtaining a channel weight e with the size of 1 multiplied by c, and realizing the relevance through 1 multiplied by 1 convolution operation; finally, multiplying the weight of the channel by the original input feature map through a recalibration operation to obtain a weighted output feature map F'C;
2) To feature map F'CTwo feature matrices F are generated using maximum pooling and average pooling operations, respectivelyMAXAnd FAVG(ii) a Then, the two feature matrixes are fused to obtain a fused feature map FMAObtaining a space attention weight W through Sigmoid activated function processing, wherein the fusion operation comprises splicing the feature matrixes according to channels and then performing convolution operation; finally, multiplying the space attention weight matrix W with the original input feature map F matrix to obtain a weighted output feature map F'S;
3) The global attention squeeze operation is the same as the channel attention, and for the excitation operation therein we willIt is replaced by an implementation of 4 consecutive operations: fc (2C/16) → ReLU → fc (1) → Sigmoid, the excitation generating a size selection factorWherein fc (·) represents full connection operation, C is the number of channels, and both ReLU and Sigmoid are activation functions; output F 'produced from space attention machine'SAnd a size selection factor mu are calculated to obtain a size-sensitive feature F'ScaleAs shown in the following formula:
F′G=F+(μ*F′S)
in order to avoid losing important information of the area with the attention value close to 0, an identity mapping item F is added.
5. The unsupervised moving object detection method based on information reduction rate as claimed in claim 4, wherein in the steps 4-5, the generator module of the trained generating confrontation network model is extracted to detect the moving object in the video sequence to be detected, and the specific steps are as follows:
firstly, carrying out the preprocessing operation of the step 1 on a video sequence to be detected;
then, calculating according to the method in the step 2 to obtain a corresponding optical flow image;
finally, the preprocessed video sequence images and the corresponding optical flow images are input into the generator g obtained in step 3, and the obtained output images are the prediction results of the moving objects.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111510928.0A CN114494934A (en) | 2021-12-10 | 2021-12-10 | Unsupervised moving object detection method based on information reduction rate |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111510928.0A CN114494934A (en) | 2021-12-10 | 2021-12-10 | Unsupervised moving object detection method based on information reduction rate |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114494934A true CN114494934A (en) | 2022-05-13 |
Family
ID=81492078
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111510928.0A Pending CN114494934A (en) | 2021-12-10 | 2021-12-10 | Unsupervised moving object detection method based on information reduction rate |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114494934A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116229336A (en) * | 2023-05-10 | 2023-06-06 | 江西云眼视界科技股份有限公司 | Video moving target identification method, system, storage medium and computer |
-
2021
- 2021-12-10 CN CN202111510928.0A patent/CN114494934A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116229336A (en) * | 2023-05-10 | 2023-06-06 | 江西云眼视界科技股份有限公司 | Video moving target identification method, system, storage medium and computer |
CN116229336B (en) * | 2023-05-10 | 2023-08-18 | 江西云眼视界科技股份有限公司 | Video moving target identification method, system, storage medium and computer |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112347859B (en) | Method for detecting significance target of optical remote sensing image | |
CN108256562B (en) | Salient target detection method and system based on weak supervision time-space cascade neural network | |
CN108133188B (en) | Behavior identification method based on motion history image and convolutional neural network | |
CN110111366B (en) | End-to-end optical flow estimation method based on multistage loss | |
CN111523410B (en) | Video saliency target detection method based on attention mechanism | |
CN113628249B (en) | RGBT target tracking method based on cross-modal attention mechanism and twin structure | |
CN107239730B (en) | Quaternion deep neural network model method for intelligent automobile traffic sign recognition | |
CN113221641B (en) | Video pedestrian re-identification method based on generation of antagonism network and attention mechanism | |
WO2019136591A1 (en) | Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network | |
CN111444924B (en) | Method and system for detecting plant diseases and insect pests and analyzing disaster grade | |
CN111461129B (en) | Context prior-based scene segmentation method and system | |
Maslov et al. | Online supervised attention-based recurrent depth estimation from monocular video | |
CN114638836A (en) | Urban street view segmentation method based on highly effective drive and multi-level feature fusion | |
Manssor et al. | Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network | |
CN116012395A (en) | Multi-scale fusion smoke segmentation method based on depth separable convolution | |
CN116402851A (en) | Infrared dim target tracking method under complex background | |
CN110688966B (en) | Semantic guidance pedestrian re-recognition method | |
CN116485867A (en) | Structured scene depth estimation method for automatic driving | |
CN113536977B (en) | 360-degree panoramic image-oriented saliency target detection method | |
CN114494934A (en) | Unsupervised moving object detection method based on information reduction rate | |
Ma et al. | A lightweight neural network for crowd analysis of images with congested scenes | |
CN110942463B (en) | Video target segmentation method based on generation countermeasure network | |
CN117522923A (en) | Target tracking system and method integrating multi-mode characteristics | |
CN117197438A (en) | Target detection method based on visual saliency | |
CN117011515A (en) | Interactive image segmentation model based on attention mechanism and segmentation method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |