CN115565108A - Video camouflage and salient object detection method based on decoupling self-supervision - Google Patents
Video camouflage and salient object detection method based on decoupling self-supervision Download PDFInfo
- Publication number
- CN115565108A CN115565108A CN202211232708.0A CN202211232708A CN115565108A CN 115565108 A CN115565108 A CN 115565108A CN 202211232708 A CN202211232708 A CN 202211232708A CN 115565108 A CN115565108 A CN 115565108A
- Authority
- CN
- China
- Prior art keywords
- video
- supervision
- self
- training
- object detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a video camouflage and salient object detection method based on decoupling self-supervision, which comprises the following steps: 1, a frame routing mechanism in an automatic supervision form is constructed, which can help a network to identify which video frames in a section of video have abundant motion information and which video frames have insufficient motion information. 2, constructing a motion segmentation network and an image segmentation network in a self-supervision mode, and detecting a disguised/significant object by using the motion segmentation network when motion information in a certain frame of video is sufficient; when the motion information in a certain frame is insufficient, detecting a disguised/significant object by using an image segmentation network; 3, the decoupling self-monitoring network framework built in the method can simultaneously complete the detection of the video camouflage object and the salient object without depending on any data annotation.
Description
Technical Field
The invention relates to a video camouflage and salient object detection method, in particular to a video camouflage and salient object detection method based on decoupling self-supervision.
Background
In recent years, with the rapid development of deep convolutional networks, masquerading and salient object detection make a great breakthrough. Compared with the traditional camouflage and salient object detection algorithm, the accuracy of the camouflage and salient object detection method based on deep learning is greatly improved, the high-level semantic information of the image can be acquired through the deep neural network, and the camouflage and salient objects in the video can be more accurately detected by utilizing the information. For example, the documents Xiankai Lu, wenguan Wang, chao Ma, jianbin Shen, ling Shao, and FatihParkli, "See More, know More: unsuperved Video Object Segmentation With Co-attachment site Networks", CVin PR,2019, halaLamdouar, charig Yang, weitixie, and Andrew Zisserman, "Betrayedby Motion: catalytic Object Segmentation Detection", in ACCV,2020, and Miao Zhang, jie Liu, yifei Wang, yongpi Piao, shuunu Yao, jie Liu, jie Yang, jie Wang, yingju Wang, and "Weitiu Liu Yang," Weitiu Xueyu, weitiu Wang, wei Xuniu, wei Xuanz, and Weitiu Ziniu, so that the results of the direct and robust Detection of the neural network information by using the translation of the respective volumes, the Video information, the two types of the neural network design, the neural network, the robust Detection by using the neural information, the translation of the translation and the environmental information.
Although the accuracy of the detection of the masquerading and salient objects can be further improved by the improvement of the network structure by the methods, the methods have the defect that the motion information of frames in a video sequence cannot be correctly identified is insufficient, and the detection performance of the network is reduced by directly fusing the insufficient optical flow information and the context information. Therefore, in order to solve this problem, a decoupling concept is proposed herein, instead of directly fusing context information and motion information to complete detection, two independent networks are designed, and optical flow information and context information in a video sequence are respectively utilized to complete detection. Meanwhile, in order to further widen the use scene of the network, a network model in an automatic supervision form is designed, so that the network model provided by the text can complete the detection task without marking data.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the technical problem of the prior art and provides a video camouflage and salient object detection method based on decoupling self-supervision.
In order to solve the technical problems, the invention discloses a video camouflage and salient object detection method based on decoupling self-supervision.
The method disclosed by the invention firstly designs a frame routing mechanism, so that the motion information of frames in a video sequence is correctly identified to be insufficient, and the motion information of the frames is sufficient. Meanwhile, two independent networks, a motion segmentation network and an image segmentation network, are designed. The motion segmentation network is used for processing video frames with sufficient optical flow information, and the optical flow information of the video frames is input to obtain corresponding detection results. The image segmentation network is used for processing video frames with insufficient optical flow information, and inputting RGB image information of the video frames to obtain corresponding detection results.
The method comprises the following specific steps:
step 1, constructing a decoupling self-supervision video camouflage and salient object detection model; the model comprises: an adaptive frame routing mechanism, a motion segmentation network and an image segmentation network;
the self-adaptive frame routing mechanism is used for carrying out sufficiency judgment on motion information of video frames in a target video;
sending the video frame with sufficient motion information selected by the self-adaptive frame routing mechanism into the motion segmentation network for processing; sending the video frames with insufficient motion information selected by the self-adaptive frame routing mechanism into the image segmentation network for processing;
combining the processing results of the motion segmentation network and the image segmentation network together to obtain the corresponding detection result of the video frame in the target video;
step 2, training the decoupling self-supervision video camouflage and salient object detection model: inputting a camouflage and salient object training video set into the decoupling self-monitoring video camouflage and salient object detection model, training an adaptive frame routing mechanism, a motion segmentation network and an image segmentation network, and performing iterative optimization on the decoupling self-monitoring video camouflage and salient object detection model;
and 3, inputting the target video to be detected into the trained decoupling self-monitoring video camouflage and salient object detection model for detection, and completing the decoupling self-monitoring-based video camouflage and salient object detection.
The training and construction method of the adaptive frame routing mechanism AFR in the step 1 comprises the following steps:
step 1-1, generating a training sample for training an adaptive frame routing mechanism (AFR);
step 1-2, training an adaptive frame routing mechanism AFR;
and 1-3, identifying whether the motion information of each frame in the target video is sufficient by using a trained adaptive frame routing mechanism AFR.
The training sample in step 1-1 comprises: the optical flow map EDP frame which is easy to decompose, and the optical flow information contained in the frame is relatively clear, so that the frame is relatively easy to decompose; and optical flow diagram HDP frames which are not easy to decompose, and optical flow information in the frames is relatively disordered and is not easy to decompose;
among them, the optical flow map corresponding to the video in the training set is directly obtained for the optical flow map EDP (easy Decomposed) frame which is easy to decompose, because the motion information of most video frames in the training set is sufficient and easy to decompose; for an optical flow graph HDP (Hardly composed) frame which is not easy to decompose, generating by a Pseudo Motion Generation module PMG (Pseudo Motion Generation), wherein the generating process comprises the following steps:
selecting a static picture as an input image, and cutting out a sequence u' epsilon R N×L×L N and L are respectively the number of the sequence frames and the image size which are cut out, and R represents the resolution of the sequence u; speed parameter a = (v) x ,v y ) Determining the moving distance of the cut frame in the horizontal and vertical directions; moving speed v in horizontal direction x And a moving speed v in the vertical direction y Selected from the set S = { -K, …, -1,0,1, …, K }; wherein K represents the maximum value of the speed;
wherein D is x Representing the displacement in the horizontal x-direction, D y Represents displacement in the vertical y-direction;
randomly selecting a cropping start point p on the input image start If the cutting end point isObtaining an image sequence u' through cutting; finally, converting the image sequence u ' into an optical flow sequence u ' by using the existing optical flow detection algorithm ' f And obtaining the optical flow diagram HDP frame which is not easy to decompose.
Through the steps, a series of HDP frames and EDP frames are obtained and used for training AFR. It is noted that if the motion information of a frame of video is sufficient, its corresponding optical flow graph is easy to decompose. If the motion information is not sufficient, its corresponding light flow graph is also difficult to decompose. Therefore, after the AFR is trained by using the HDP frame and the EDP frame, the AFR can judge whether the motion information of a certain frame of image is sufficient
The method for training the adaptive frame routing mechanism AFR described in step 1-2 comprises:
training the adaptive frame routing mechanism AFR using asymmetric losses, the asymmetric losses defined as:
L q (u)=[(a+1) b -(a+u) b ]/b
wherein the first parameter a =1 and the second parameter b =2; u is the cross entropy loss, expressed as:
wherein y represents the true label of a training sample,representing a predictive label for a training sample.
The identification method in the steps 1-3 comprises the following steps:
the input of the decoupling self-supervision video camouflage and salient object detection model is a video sequence And corresponding optical flow sequenceWherein, T a For the number of input frames, H and W are image sizes of input frames,representing a resolution size of a video frame;
selecting a frame containing sufficient motion information using the adaptive frame routing mechanism AFRAnd frames with insufficient motion informationWherein T is a =T m +T c ;T m Indicating the number of sufficient frames of motion information, T c Indicating the number of frames for which motion information is insufficient.
The construction method of the motion segmentation network and the image segmentation network in the step 1 comprises the following steps:
step 1-4, constructing a motion segmentation network MS; for segmenting foreground objects from the motion representation;
step 1-5, constructing an image segmentation network CS; for segmenting foreground objects from RGB images.
The motion split network MS described in steps 1-4 comprises three components: a Convolutional Neural Network (CNN) encoder for extracting a feature representation; generating a model for generating foreground and background representations; a CNN decoder for decoding the foreground and background representations to a final output;
let X f For a single light-ray diagram, first the light-ray diagram X is f Sent to a CNN encoder phi enc And outputting a low-resolution feature:
wherein H 0 And W 0 Respectively representing the spatial dimension of the output features, and D representing the channel size;
for this feature F, the query vector is updatedA total of T times, whereinRepresents the query vector after the t-th update, q is the [0,1 ]]Is a category associated with the query embedding, 0 represents backScene, 1 represents the foreground; the query vector is learnable and is initialized by using a random weight;
wherein mu and sigma are mean and variance of Gaussian distribution, and d is the magnitude of weight vector; by Z (t) ∈R 2×d To represent query vectors of all categories; in the course of the subsequent process,andas a whole Z (t) Are updated at the same time;
query vector Z (t+1) Using features F and Z (t) Updating is carried out;
first, a 1 × 1 convolutional layer is used to reduce the channel of F and flatten the spatial dimension of F, resulting in a characteristic F':
wherein the characteristic length L = H 0 ×W 0 (ii) a Meanwhile, the position vector is added into F' to enhance the extraction of the spatial information; namely thatWhere the PE is a position vector and where,is a new feature representation after adding the position vector to F'; then using two multi-layer perceptron MLP layersAndeach layer is provided with three full-connection layer layers and a modified linear unit layer; calculate query value query, key value key using the self-attention mechanism:
obtaining the attention A through a normalized exponential function Softmax function (t) :
The Attention mechanism calculates a weighted sum of features in the spatial dimension by:
query vector Z (t) Finally updated by a loop Gate Unit GRU (Gate recovery Unit, GRU) to:
Z (t+1) =GRU(U (t) ,Z (t) )
note U (t) And Z (t) An input state and a hidden state; iterating the generated model for 3 times, and outputting as O:
wherein, the first and the second end of the pipe are connected with each other,a foreground query vector is represented that is,representing a background query vector; in the decoding process, the two vectors are broadcast onto a spatial position-coded two-dimensional grid;
finally, the CNN decoder phi dec Separately decode { O f ,O b To original resolution:
wherein, the first and the second end of the pipe are connected with each other,is the optical flow field that is reconstructed,is a reconstructed foreground optical flow field,is the reconstructed background light flow field. Alpha is alpha fore Is the MASK image corresponding to the foreground optical flow field, alpha back Is the MASK image corresponding to the background light flow field. The final reconstructed graph can be expressed as:
wherein, for { alpha fore ,α back Use Softmax to ensure α fore +α back =1; the MS branch completes training in a self-supervision mode, and the loss function comprises reconstruction loss L rec And entropy regularization loss L ent :
L ent (L ent The purpose of (2) is to make the mask binary, so that the final segmentation result can be obtained. ) Is defined as:
L ent =-(α fore ·log(α fore ))+(1-α bacb )·log(1-α back )
alpha mask alpha is in the form of a one-hot coded code; when alpha is fore And alpha back When the foreground and background can be clearly represented, L ent Is zero; when alpha is fore And alpha back Cannot represent foreground and background, and they represent values close to L ent Is at a maximum;
finally, the sequence X is obtained through the training mode F Corresponding result O F 。
The input to the image segmentation network CS described in steps 1-5 is a video sequence X R The output is O through a single image camouflaged object detection method (refer to: yunqiuLv, hanging Zhang, yuchao Dai, aixuan Li, bowen Liu, nick Barnes, and Dengging Fan, "Simulaneooufilly localization, segment and rank the camouflaged objects," in CVPR, 2021.) or a single image salient object detection method (refer to: qiabin Hou, ming-Ming Cheng, xiaoowei Hu, ali Borji, and Zhuowen Tu, "deep neighboring sampled object detection with short connections," TPAMI, 2019) R 。
The method for training the decoupling self-supervision video camouflage and salient object detection model in the step 2 comprises the following steps:
step 2-1, data preprocessing: enhancing data to be input into a training set of disguised and salient objects of a decoupling self-supervision video disguised and salient object detection model, wherein the training set of the disguised and salient objects is randomly turned and randomly cut;
2-2, training an AFR classifier by using a pseudo-motion generation module PMG module to form data, so that whether motion information contained in a section of video information is sufficient or not can be distinguished;
carrying out self-supervision training on the motion segmentation network MS, so that complete and accurate object detection can be completed according to the optical flow diagram;
the generation result of the motion segmentation network MS is used for carrying out supervision training on the image segmentation network CS block, so that the image segmentation network CS can complete and accurate object detection through RGB images;
the results of the motion segmentation network MS and the image segmentation network CS are mutually cross-supervised, so that the network can gradually generate complete and accurate camouflage and saliency object graphs, and the final network model parameters are stored after the network is trained repeatedly.
The method for inputting the target video to be detected into the trained decoupling self-supervision video camouflage and salient object detection model for detection in the step 3 comprises the following steps: and inputting the target image to be detected into a trained decoupling self-supervision video camouflage and salient object detection model for reasoning to obtain corresponding camouflage and salient object segmentation images.
Has the advantages that:
the decoupling idea is provided, and instead of directly fusing context information and motion information to complete detection, two independent networks are designed to respectively utilize optical flow information and context information in a video sequence to complete detection. Meanwhile, in order to further widen the use scene of the network, a network model in an automatic supervision form is designed, so that the network model provided by the text can complete the detection task without marking data.
Drawings
The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a schematic process flow diagram of the present invention.
Fig. 2 is a schematic diagram of a pseudo motion generation process.
FIG. 3 is a schematic diagram of the detection results of the present invention.
Detailed Description
A video camouflage and salient object detection method based on decoupling self-supervision is disclosed, as shown in figure 1, and comprises the following steps:
step 1, constructing an adaptive frame routing mechanism (AFR): building a self-adaptive frame routing mechanism, and distinguishing which video frames have insufficient motion information and which video frames have sufficient motion information in a section of video;
step 2, constructing a motion segmentation network and an image segmentation network: and (4) selecting a video frame with sufficient motion information by using the AFR mechanism in the step (1) and sending the video frame into a motion segmentation network. And simultaneously, selecting a video frame with insufficient motion information by using the AFR, and sending the video frame into an image segmentation network. Finally, the results of the motion segmentation network and the image segmentation network are spliced together to obtain the detection result of a section of video frame;
wherein, step 1 includes the following steps:
step 1-1, generating a training sample for training AFR.
Optical flow information is used herein to represent motion information for each sample, where when motion information for a frame is sufficient, its corresponding optical flow map is Easily Decomposed (EDP). If the motion information of a frame is not sufficient, the corresponding optical flow graph is not easy to be decomposed (HDP). Therefore, in order to train the AFR, the document needs to create its corresponding training sample. For EDP frames, directly taking an optical flow graph corresponding to the training video; the HDP frame is generated by a completely new pseudo-motion generation module (PMG), the process of which is shown in fig. 2. The specific flow of PMG is as follows:
selecting a static picture and a speed parameter s, cutting out a sequence u e R from the static picture N×L×L And N and L are the number of the cropped serial frames and the image size, respectively. Speed parameter s = (v) x ,v y ) The moving distance of the cropped frames in the horizontal and vertical directions is determined. v. of x 、v y Selected from the set S = { -K, …, -1,0,1, …, K }. Where K represents the maximum value of the speed.
then, a clipping start point p is randomly selected on the image start If the clipping end point is p end =p start + D. Cutting out the sequence u, and covering the sequence u on the p of the original image start Obtaining the sequence u' epsilon R at the position N×H×W . Finally, the image sequence u ' is converted into an optical flow sequence u ' using RAFT ' f Thereby obtaining an HDP frame. Finally, we use asymmetric loss training AFR, which is defined as:
L q (u)=[(a+1) b -(a+u) b ]/b
where a =1 and b =2,u is the cross-entropy loss, which can be expressed as:
and 1-2, identifying whether the motion information of each frame in a video is sufficient by using the trained AFR.
The input of the decoupling self-monitoring network model is a video sequenceAnd corresponding optical flow sequenceWherein T is a For the number of frames input, H, W is the image size of the input frame. We use AFR to select frames that contain sufficient motion informationAnd frames with insufficient motion information Wherein T is a =T m +T c 。
The step 2 comprises the following steps:
step 2-1, constructing a motion segmentation network MS for segmenting foreground objects from motion representations;
the motion split network MS comprises three components: a CNN encoder extracts feature representation; 2. a model is generated, and the model is generated,
for generating foreground and background representations; a CNN decoder to decode the foreground and background representations, respectively, to a final output. To simplify the explanation of the procedure, we show a single light ray diagram X f For example. First, a light beam pattern X f Sent to a CNN encoder
Code device phi enc It outputs a low resolution feature:
wherein H 0 And W 0 Respectively representing the spatial dimensions of the output features. D denotes the channel size.
For this feature F, we propose to generate model update query vectorsFor a time of T, whereinRepresents the query vector after the t-th update, q ∈ [0,1 ∈]Is a category relevant to the query embedding. "0" represents the background and "1" represents the foreground. The query vector is learnable and is initialized with random weights:
wherein, mu and sigma are mean and variance of Gaussian distribution, and d is the magnitude of weight vector. By Z (t) ∈R 2×d To represent query vectors of all categories. In the course of the subsequent processes, it is,andwill be taken as a whole Z (t) And is updated at the same time.
Query vector Z (t+1) Using features F and Z (t) And (6) updating. First, a 1 × 1 convolutional layer is used to reduce the channel of F and flatten the spatial dimension of F, resulting in a characteristic F':
wherein, L = H 0 ×W 0 . At the same time, the position vector is added to F' to enhance the extraction of spatial information. Namely, it is Where PE is a location vector. Then two MLP layers are usedAndeach layer has three FC layers and a ReLU layer. The design is to make the self-attribute mechanism have higher flexibility when calculating the query and the key:
obtaining attention A through Softmax function (t) :
The Attention mechanism calculates a weighted sum of features in the spatial dimension by:
query vector Z (t) Finally updated by GRU as:
Z (t+1) =GRU(U (t) ,Z (t) )
note U (t) And Z (t) An input state and a hidden state. The generative model is iterated 3 times, with the output being O:
wherein the content of the first and second substances,a foreground query vector is represented that is,representing a background query vector. During decoding, the two vectors are broadcast onto a two-dimensional grid with a learnable spatial position code.
Finally, the CNN decoder phi dec Separately decode { O f ,O b To original resolution:
wherein the content of the first and second substances,is the reconstructed optical flow field, alpha fore Is the corresponding MASK image. Thus, the final reconstructed image is:
wherein, for { alpha fore ,α back Use Softmax to ensure α fore +α back =1. In order to make the MS branch capable of completing training in a self-supervision mode, the loss function comprises reconstruction loss L rec And entropy regularization loss L ent 。
L enet The purpose of (2) is to make the mask binary, and we can get the final segmentation result. L is ent Is defined as:
L ent =-(α fore ·log(α fore ))+(1-α bacb )·log(1-α back )
as can be seen from this loss, L is the one-hot code when the alpha mask is in one-hot form ent Will be zero, when their probabilities are equal, L ent Will be the largest. Finally, in this way, the sequence X can be obtained here F Corresponding result O F 。
And 2-2, constructing an image segmentation network CS for segmenting the foreground object from the RGB image.
Input to CS is a video sequence X R The output is O by the existing single image camouflaged object detection method or single image salient object detection method R 。
step 3-1, data preprocessing: data enhancement such as random turning, random cutting and the like is carried out on a camouflage object training set and a saliency object training set to be input into a decoupling self-monitoring network;
step 3-2, firstly, generating data by using a PMG module root to train an AFR classifier so that the AFR classifier can distinguish whether motion information contained in a section of video information is sufficient or not; secondly, the MS module is self-supervised trained, so that complete and accurate object detection can be completed according to the optical flow diagram. Thirdly, the CS module is supervised and trained by using the generated result of the MS module, so that the CS module can complete and accurate object detection through the RGB image. Fourthly, the results of the MS and the CS are mutually cross-supervised, so that the network can gradually generate complete and accurate camouflage and significance object graphs, and the final network model parameters are stored after the network is trained repeatedly for multiple rounds;
step 3-3, testing a model framework: and inputting the images into a trained decoupling self-monitoring network for reasoning to obtain corresponding disguised and significant object segmentation images.
Example (b):
a video camouflage and salient object detection method based on decoupling self-supervision is implemented according to the following steps as shown in figure 1:
1. constructing a decoupling self-monitoring network G:
inputting: a video collection of camouflaged or salient objects.
And (3) outputting: a corresponding disguised or salient object segmentation image, and a loss function.
1.1 constructing a decoupling self-monitoring network model framework to extract optical flow;
the network input provided by the decoupling self-monitoring network model skeleton is a video sequenceAnd corresponding optical flow sequenceX F An auto-optical flow estimation algorithm RAFT is extracted. Wherein T is a For the number of frames input, H, W is the image size of the input frame.
1.2 an adaptive frame routing mechanism AFR is designed to distinguish which frames have sufficient optical flow information and which frames have insufficient optical flow information. The frames with sufficient optical flow information are sent to a motion segmentation network (MS) and the optical flow information is used to obtain the corresponding segmentation result. Frames with insufficient optical flow information are sent to an image segmentation network (CS) and corresponding segmentation results are obtained by using RGB image information. And then calculating a loss function by using the segmentation result, and performing parameter optimization.
2. Training an integral framework;
the deep learning convolutional neural network training parameters based on the double branches comprise data preprocessing, model framework training and testing stages.
3.1, preprocessing data;
and carrying out adjustment such as pull-up, inversion and the like on the input video set of the camouflage object and the salient object, and inputting the adjusted video set into a decoupling self-monitoring network.
Inputting: video collections of camouflaged and salient objects.
And (3) outputting: a video collection of data enhanced camouflaged and salient objects.
Geometric reinforcement: the generalization capability of the model can be enhanced by methods of changing the image geometry such as translation, rotation and shearing;
3.2 model framework training
Inputting: data enhanced video collection of camouflaged and salient objects
And (3) outputting: video set segmentation results of camouflaged and salient objects and a loss function.
During training, a small batch Stochastic Gradient Descent (SGD) optimization algorithm with a batchsize of 32, a momentum of 0.9, and a weight decay of 1e-5 may be used. The learning rate is set to 1e-4 and the maximum epoch is set to 100. The training image is adjusted to 352 x 352 as input to the entire network.
3.3 testing the model framework;
inputting: a video set of camouflaged and salient objects;
and (3) outputting: corresponding camouflage and saliency object cut images;
the model detection effect in the present invention is shown in fig. 3, which shows a total of 6 video sequences. Wherein, sequence 1 to sequence 3 represent saliency detection video sequences, and sequence 4 to sequence 6 represent masquerading detection video sequences. For each sequence, the first line represents the input video sequence, the second line represents the segmentation result, and the third line represents the optical flow information for each frame of video. The optical flow information of the first three columns is sufficient, the model completes the segmentation in the MS by using the optical flow information, the video information of the second two columns is insufficient in motion, and the model completes the segmentation in the CS by using the RGB picture information.
In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, and the computer program, when executed by the data processing unit, may execute the inventive content of the video camouflaging and salient object detection method based on decoupled self-supervision and some or all of the steps in each embodiment provided by the present invention. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
It is clear to those skilled in the art that the technical solutions in the embodiments of the present invention can be implemented by means of a computer program and its corresponding general-purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a computer program, that is, a software product, which may be stored in a storage medium and include several instructions to enable a device (which may be a personal computer, a server, a single chip microcomputer MUU or a network device, etc.) including a data processing unit to execute the method described in each embodiment or some portions of the embodiments of the present invention.
The present invention provides a thought and a method for video camouflage and salient object detection based on decoupling self-supervision, and a plurality of methods and ways for implementing the technical solution are provided, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, without departing from the principle of the present invention, a plurality of improvements and embellishments can be made, and should be regarded as the protection scope of the present invention. All the components not specified in this embodiment can be implemented by the prior art.
Claims (10)
1. A video camouflage and salient object detection method based on decoupling self-supervision is characterized by comprising the following steps:
step 1, constructing a decoupling self-supervision video camouflage and salient object detection model; the model comprises: an adaptive frame routing mechanism, a motion segmentation network and an image segmentation network;
the self-adaptive frame routing mechanism is used for carrying out sufficiency judgment on motion information of video frames in a target video;
sending the video frame with sufficient motion information selected by the self-adaptive frame routing mechanism into the motion segmentation network for processing; sending the video frames with insufficient motion information selected by the self-adaptive frame routing mechanism into the image segmentation network for processing;
combining the processing results of the motion segmentation network and the image segmentation network together to obtain the corresponding detection result of the video frame in the target video;
step 2, training the decoupling self-supervision video camouflage and salient object detection model: inputting a camouflage and salient object training video set into the decoupling self-monitoring video camouflage and salient object detection model, training an adaptive frame routing mechanism, a motion segmentation network and an image segmentation network, and performing iterative optimization on the decoupling self-monitoring video camouflage and salient object detection model;
and 3, inputting the target video to be detected into the trained decoupling self-supervision video camouflage and salient object detection model for detection, and completing the decoupling self-supervision-based video camouflage and salient object detection.
2. The method for detecting video masquerading and salient objects based on decoupling self-supervision as claimed in claim 1, wherein the method for training and constructing the adaptive frame routing mechanism AFR in step 1 comprises the following steps:
step 1-1, generating a training sample for training an adaptive frame routing mechanism (AFR);
step 1-2, training an adaptive frame routing mechanism (AFR);
and 1-3, identifying whether the motion information of each frame in the target video is sufficient by using a trained adaptive frame routing mechanism AFR.
3. The method for detecting video camouflaging and salient objects based on decoupled self-supervision as claimed in claim 2, wherein the training samples in step 1-1 comprise: an easily decomposed optical flow map EDP frame and an easily decomposed optical flow map HDP frame;
wherein, for the easily decomposed EDP frame, directly taking the light flow graph corresponding to the video in the training set; for the optical flow diagram HDP frame which is not easy to be decomposed, generating by a false action generating module PMG, wherein the generating process comprises the following steps:
selecting a static picture as an input image, and cutting out a sequence u' epsilon R N×L×L N and L are respectively the number of the sequence frames and the size of the image, and R represents the resolution of the sequence u; speed parameter s = (v) x ,v y ) Determining the moving distance of the cut frame in the horizontal and vertical directions; moving speed v in horizontal direction x And a moving speed v in the vertical direction y Selecting from the set S = { -K., -1,0,1, · K }; wherein K represents the maximum value of the speed;
wherein D is x Representing the displacement in the horizontal x-direction, D y Represents displacement in the vertical y-direction;
randomly selecting a cropping start point p on the input image start The end point of cutting isObtaining an image sequence u' through cutting; finally, the image sequence u 'is converted into an optical flow sequence u' f And obtaining the optical flow diagram HDP frame which is not easy to decompose.
4. The method for detecting video masquerading and salient objects based on decoupling self-supervision as claimed in claim 3, wherein the method for training adaptive frame routing mechanism AFR in step 1-2 comprises:
training the adaptive frame routing mechanism AFR using asymmetric losses, the asymmetric losses being defined as:
L q (u)=[(a+1) b -(a+u) b ]/b
wherein the first parameter a =1 and the second parameter b =2; u is the cross entropy loss, expressed as:
5. The video camouflaging and salient object detection method based on decoupled self-supervision as claimed in claim 4, wherein the identification method in steps 1-3 comprises:
the input of the decoupling self-supervision video camouflage and salient object detection model is a video sequence And corresponding optical flow sequenceWherein, T a For the number of input frames, H and W are image sizes of input frames,representing a resolution size of a video frame;
selecting a frame containing sufficient motion information using the adaptive frame routing mechanism AFRAnd frames with insufficient motion informationWherein T is a =T m +T c ;T m Indicating the number of sufficient frames of motion information, T c Indicating the number of frames for which motion information is insufficient.
6. The method for detecting video camouflaging and salient objects based on decoupling self-supervision as claimed in claim 5, wherein the method for constructing the motion segmentation network and the image segmentation network in step 1 comprises the following steps:
step 1-4, constructing a motion segmentation network MS; for segmenting foreground objects from the motion representation;
step 1-5, constructing an image segmentation network CS; for segmenting foreground objects from RGB images.
7. The method for video camouflaging and salient object detection based on decoupled self-supervision according to claim 6, wherein the motion segmentation network MS in steps 1-4 comprises three components: a convolutional neural network encoder for extracting a feature representation; generating a model for generating foreground and background representations; a CNN decoder for decoding the foreground and background representations to a final output;
let X f For a single light-ray diagram, first the light-ray diagram X is f Sent to a CNN encoder phi enc And outputting a low-resolution feature:
wherein H 0 And W 0 Respectively representing the spatial dimension of the output features, and D representing the channel size;
for this feature F, the query vector is updatedA total of T times, whereinRepresents the query vector after the t-th update, q ∈ [0,1 ∈]Is a category relevant to the query embedding, 0 represents background, 1 represents foreground; initializing the query vector by using a random weight;
wherein, mu and sigma are mean and variance of Gaussian distribution, and d is the size of the weight vector; by Z (t) ∈R 2×d To represent query vectors of all categories; in the course of the subsequent process,andas a whole Z (t) Are updated at the same time;
query vector Z (t+1) Using features F and Z (t) Updating is carried out;
first, a 1 × 1 convolutional layer is used to reduce the channel of F and flatten the spatial dimension of F, resulting in a characteristic F':
wherein the characteristic length L = H 0 ×W 0 (ii) a Meanwhile, the position vector is added into F' to enhance the extraction of the spatial information; namely thatWhere the PE is a position vector of the position,is a new feature representation after adding the position vector to F'; then using two multi-layer perceptron MLP layersAndeach layer is provided with three full-connection layers and a correction linear unit layer; calculate query value query, key value key using the self-attention mechanism:
obtaining the attention A through a normalized exponential function Softmax function (t) :
The Attention mechanism calculates a weighted sum of features in the spatial dimension by:
query vector Z (t) Finally updated by the loop gate unit GRU to:
Z (t+1) =GRU(U (t) ,Z (t) )
note U (t) And Z (t) An input state and a hidden state; iterating the generated model for 3 times, and outputting as O:
wherein the content of the first and second substances,a foreground query vector is represented that is,representing a background query vector; during decoding, the two vectors are broadcast onto a spatial position-coded two-dimensional grid;
finally, the CNN decoder phi dec Separately decode { O f ,O b To original resolution:
wherein the content of the first and second substances,is the optical flow field that is reconstructed,is a reconstructed foreground optical flow field,is a reconstructed background light flow field; alpha is alpha fore Is the MASK image corresponding to the foreground optical flow field, alpha back Is a MASK image corresponding to the background light flow field; the final reconstructed graph can be expressed as:
wherein, for { alpha fore ,α back Use Softmax to ensure α fore +α back =1; the MS branch completes training in a self-supervision mode, and the loss function comprises reconstruction loss L rec And entropy regularization loss L ent :
L ent Is defined as:
L ent =-(α fore ·log(α fore ))+(1-α back )·log(1-α back )
alpha mask alpha is in the form of a one-hot coded code; when alpha is fore And alpha back When the foreground and background can be clearly represented, L ent Is zero; when alpha is fore And alpha back Cannot represent foreground and background, and they represent values close to L ent Is at a maximum;
finally, the sequence X is obtained through the training mode F Corresponding toResults O F 。
8. The method for detecting video masquerading and salient objects based on decoupling self-supervision as claimed in claim 7, wherein the input of the image segmentation network CS in steps 1-5 is a video sequence X R The output is O through a single image camouflaged object detection method or a single image salient object detection method R 。
9. The method for detecting video masquerading and salient objects based on decoupling self-supervision as claimed in claim 8, wherein the method for training the video masquerading and salient object detection model based on decoupling self-supervision in step 2 comprises:
step 2-1, data preprocessing: enhancing data to be input into a training set of disguised and salient objects of a decoupling self-supervision video disguised and salient object detection model, wherein the training set of the disguised and salient objects is randomly turned and randomly cut;
2-2, training an AFR classifier by using a pseudo-motion generation module PMG module to form data, so that whether motion information contained in a section of video information is sufficient or not can be distinguished;
carrying out self-supervision training on the motion segmentation network MS, so that complete and accurate object detection can be completed according to the optical flow diagram;
the generation result of the motion segmentation network MS is used for carrying out supervision training on the image segmentation network CS block, so that the image segmentation network CS can complete and accurate object detection through RGB images;
the results of the motion segmentation network MS and the image segmentation network CS are mutually crossed and supervised, so that the network can gradually generate a complete and accurate camouflage and saliency object map, and the final network model parameters are stored after the network is trained repeatedly for multiple rounds.
10. The method for detecting video masquerading and salient objects based on decoupling and self-supervision as claimed in claim 9, wherein the method for inputting the target video to be detected into the trained decoupling and self-supervision video masquerading and salient object detection model in step 3 comprises: and inputting the target image to be detected into a trained decoupling self-supervision video camouflage and salient object detection model for reasoning to obtain corresponding camouflage and salient object segmentation images.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211232708.0A CN115565108A (en) | 2022-10-10 | 2022-10-10 | Video camouflage and salient object detection method based on decoupling self-supervision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211232708.0A CN115565108A (en) | 2022-10-10 | 2022-10-10 | Video camouflage and salient object detection method based on decoupling self-supervision |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115565108A true CN115565108A (en) | 2023-01-03 |
Family
ID=84745836
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211232708.0A Pending CN115565108A (en) | 2022-10-10 | 2022-10-10 | Video camouflage and salient object detection method based on decoupling self-supervision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115565108A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116935189A (en) * | 2023-09-15 | 2023-10-24 | 北京理工导航控制科技股份有限公司 | Camouflage target detection method and device based on neural network and storage medium |
-
2022
- 2022-10-10 CN CN202211232708.0A patent/CN115565108A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116935189A (en) * | 2023-09-15 | 2023-10-24 | 北京理工导航控制科技股份有限公司 | Camouflage target detection method and device based on neural network and storage medium |
CN116935189B (en) * | 2023-09-15 | 2023-12-05 | 北京理工导航控制科技股份有限公司 | Camouflage target detection method and device based on neural network and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11200424B2 (en) | Space-time memory network for locating target object in video content | |
CN106960206B (en) | Character recognition method and character recognition system | |
CN109711463B (en) | Attention-based important object detection method | |
CN112750140B (en) | Information mining-based disguised target image segmentation method | |
CN110570433B (en) | Image semantic segmentation model construction method and device based on generation countermeasure network | |
CN113591968A (en) | Infrared weak and small target detection method based on asymmetric attention feature fusion | |
CN112884802B (en) | Attack resistance method based on generation | |
CN112801068B (en) | Video multi-target tracking and segmenting system and method | |
CN116311214B (en) | License plate recognition method and device | |
Chen et al. | Finding arbitrary-oriented ships from remote sensing images using corner detection | |
CN113065550A (en) | Text recognition method based on self-attention mechanism | |
CN114037640A (en) | Image generation method and device | |
CN114283352A (en) | Video semantic segmentation device, training method and video semantic segmentation method | |
CN111325766A (en) | Three-dimensional edge detection method and device, storage medium and computer equipment | |
US20230154139A1 (en) | Systems and methods for contrastive pretraining with video tracking supervision | |
CN114140831B (en) | Human body posture estimation method and device, electronic equipment and storage medium | |
CN115565108A (en) | Video camouflage and salient object detection method based on decoupling self-supervision | |
CN112149526A (en) | Lane line detection method and system based on long-distance information fusion | |
CN115861756A (en) | Earth background small target identification method based on cascade combination network | |
Hughes et al. | A semi-supervised approach to SAR-optical image matching | |
CN116665114A (en) | Multi-mode-based remote sensing scene identification method, system and medium | |
WO2023185074A1 (en) | Group behavior recognition method based on complementary spatio-temporal information modeling | |
CN111209886A (en) | Rapid pedestrian re-identification method based on deep neural network | |
CN113780241B (en) | Acceleration method and device for detecting remarkable object | |
CN115965968A (en) | Small sample target detection and identification method based on knowledge guidance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |