CN115565108A - Video camouflage and salient object detection method based on decoupling self-supervision - Google Patents

Video camouflage and salient object detection method based on decoupling self-supervision Download PDF

Info

Publication number
CN115565108A
CN115565108A CN202211232708.0A CN202211232708A CN115565108A CN 115565108 A CN115565108 A CN 115565108A CN 202211232708 A CN202211232708 A CN 202211232708A CN 115565108 A CN115565108 A CN 115565108A
Authority
CN
China
Prior art keywords
video
supervision
self
training
object detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211232708.0A
Other languages
Chinese (zh)
Inventor
黄明江
李文丽
孙德生
薛豪奇
赵鑫
陈伟
邢星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xuchang University
Original Assignee
Xuchang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xuchang University filed Critical Xuchang University
Priority to CN202211232708.0A priority Critical patent/CN115565108A/en
Publication of CN115565108A publication Critical patent/CN115565108A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video camouflage and salient object detection method based on decoupling self-supervision, which comprises the following steps: 1, a frame routing mechanism in an automatic supervision form is constructed, which can help a network to identify which video frames in a section of video have abundant motion information and which video frames have insufficient motion information. 2, constructing a motion segmentation network and an image segmentation network in a self-supervision mode, and detecting a disguised/significant object by using the motion segmentation network when motion information in a certain frame of video is sufficient; when the motion information in a certain frame is insufficient, detecting a disguised/significant object by using an image segmentation network; 3, the decoupling self-monitoring network framework built in the method can simultaneously complete the detection of the video camouflage object and the salient object without depending on any data annotation.

Description

Video camouflage and salient object detection method based on decoupling self-supervision
Technical Field
The invention relates to a video camouflage and salient object detection method, in particular to a video camouflage and salient object detection method based on decoupling self-supervision.
Background
In recent years, with the rapid development of deep convolutional networks, masquerading and salient object detection make a great breakthrough. Compared with the traditional camouflage and salient object detection algorithm, the accuracy of the camouflage and salient object detection method based on deep learning is greatly improved, the high-level semantic information of the image can be acquired through the deep neural network, and the camouflage and salient objects in the video can be more accurately detected by utilizing the information. For example, the documents Xiankai Lu, wenguan Wang, chao Ma, jianbin Shen, ling Shao, and FatihParkli, "See More, know More: unsuperved Video Object Segmentation With Co-attachment site Networks", CVin PR,2019, halaLamdouar, charig Yang, weitixie, and Andrew Zisserman, "Betrayedby Motion: catalytic Object Segmentation Detection", in ACCV,2020, and Miao Zhang, jie Liu, yifei Wang, yongpi Piao, shuunu Yao, jie Liu, jie Yang, jie Wang, yingju Wang, and "Weitiu Liu Yang," Weitiu Xueyu, weitiu Wang, wei Xuniu, wei Xuanz, and Weitiu Ziniu, so that the results of the direct and robust Detection of the neural network information by using the translation of the respective volumes, the Video information, the two types of the neural network design, the neural network, the robust Detection by using the neural information, the translation of the translation and the environmental information.
Although the accuracy of the detection of the masquerading and salient objects can be further improved by the improvement of the network structure by the methods, the methods have the defect that the motion information of frames in a video sequence cannot be correctly identified is insufficient, and the detection performance of the network is reduced by directly fusing the insufficient optical flow information and the context information. Therefore, in order to solve this problem, a decoupling concept is proposed herein, instead of directly fusing context information and motion information to complete detection, two independent networks are designed, and optical flow information and context information in a video sequence are respectively utilized to complete detection. Meanwhile, in order to further widen the use scene of the network, a network model in an automatic supervision form is designed, so that the network model provided by the text can complete the detection task without marking data.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the technical problem of the prior art and provides a video camouflage and salient object detection method based on decoupling self-supervision.
In order to solve the technical problems, the invention discloses a video camouflage and salient object detection method based on decoupling self-supervision.
The method disclosed by the invention firstly designs a frame routing mechanism, so that the motion information of frames in a video sequence is correctly identified to be insufficient, and the motion information of the frames is sufficient. Meanwhile, two independent networks, a motion segmentation network and an image segmentation network, are designed. The motion segmentation network is used for processing video frames with sufficient optical flow information, and the optical flow information of the video frames is input to obtain corresponding detection results. The image segmentation network is used for processing video frames with insufficient optical flow information, and inputting RGB image information of the video frames to obtain corresponding detection results.
The method comprises the following specific steps:
step 1, constructing a decoupling self-supervision video camouflage and salient object detection model; the model comprises: an adaptive frame routing mechanism, a motion segmentation network and an image segmentation network;
the self-adaptive frame routing mechanism is used for carrying out sufficiency judgment on motion information of video frames in a target video;
sending the video frame with sufficient motion information selected by the self-adaptive frame routing mechanism into the motion segmentation network for processing; sending the video frames with insufficient motion information selected by the self-adaptive frame routing mechanism into the image segmentation network for processing;
combining the processing results of the motion segmentation network and the image segmentation network together to obtain the corresponding detection result of the video frame in the target video;
step 2, training the decoupling self-supervision video camouflage and salient object detection model: inputting a camouflage and salient object training video set into the decoupling self-monitoring video camouflage and salient object detection model, training an adaptive frame routing mechanism, a motion segmentation network and an image segmentation network, and performing iterative optimization on the decoupling self-monitoring video camouflage and salient object detection model;
and 3, inputting the target video to be detected into the trained decoupling self-monitoring video camouflage and salient object detection model for detection, and completing the decoupling self-monitoring-based video camouflage and salient object detection.
The training and construction method of the adaptive frame routing mechanism AFR in the step 1 comprises the following steps:
step 1-1, generating a training sample for training an adaptive frame routing mechanism (AFR);
step 1-2, training an adaptive frame routing mechanism AFR;
and 1-3, identifying whether the motion information of each frame in the target video is sufficient by using a trained adaptive frame routing mechanism AFR.
The training sample in step 1-1 comprises: the optical flow map EDP frame which is easy to decompose, and the optical flow information contained in the frame is relatively clear, so that the frame is relatively easy to decompose; and optical flow diagram HDP frames which are not easy to decompose, and optical flow information in the frames is relatively disordered and is not easy to decompose;
among them, the optical flow map corresponding to the video in the training set is directly obtained for the optical flow map EDP (easy Decomposed) frame which is easy to decompose, because the motion information of most video frames in the training set is sufficient and easy to decompose; for an optical flow graph HDP (Hardly composed) frame which is not easy to decompose, generating by a Pseudo Motion Generation module PMG (Pseudo Motion Generation), wherein the generating process comprises the following steps:
selecting a static picture as an input image, and cutting out a sequence u' epsilon R N×L×L N and L are respectively the number of the sequence frames and the image size which are cut out, and R represents the resolution of the sequence u; speed parameter a = (v) x ,v y ) Determining the moving distance of the cut frame in the horizontal and vertical directions; moving speed v in horizontal direction x And a moving speed v in the vertical direction y Selected from the set S = { -K, …, -1,0,1, …, K }; wherein K represents the maximum value of the speed;
for an input image of size H × W, the distance is moved
Figure BDA0003882111240000031
Is defined as:
Figure BDA0003882111240000032
Figure BDA0003882111240000033
wherein D is x Representing the displacement in the horizontal x-direction, D y Represents displacement in the vertical y-direction;
randomly selecting a cropping start point p on the input image start If the cutting end point is
Figure BDA0003882111240000034
Obtaining an image sequence u' through cutting; finally, converting the image sequence u ' into an optical flow sequence u ' by using the existing optical flow detection algorithm ' f And obtaining the optical flow diagram HDP frame which is not easy to decompose.
Through the steps, a series of HDP frames and EDP frames are obtained and used for training AFR. It is noted that if the motion information of a frame of video is sufficient, its corresponding optical flow graph is easy to decompose. If the motion information is not sufficient, its corresponding light flow graph is also difficult to decompose. Therefore, after the AFR is trained by using the HDP frame and the EDP frame, the AFR can judge whether the motion information of a certain frame of image is sufficient
The method for training the adaptive frame routing mechanism AFR described in step 1-2 comprises:
training the adaptive frame routing mechanism AFR using asymmetric losses, the asymmetric losses defined as:
L q (u)=[(a+1) b -(a+u) b ]/b
wherein the first parameter a =1 and the second parameter b =2; u is the cross entropy loss, expressed as:
Figure BDA0003882111240000041
wherein y represents the true label of a training sample,
Figure BDA0003882111240000042
representing a predictive label for a training sample.
The identification method in the steps 1-3 comprises the following steps:
the input of the decoupling self-supervision video camouflage and salient object detection model is a video sequence
Figure BDA0003882111240000043
Figure BDA0003882111240000044
And corresponding optical flow sequence
Figure BDA0003882111240000045
Wherein, T a For the number of input frames, H and W are image sizes of input frames,
Figure BDA0003882111240000046
representing a resolution size of a video frame;
selecting a frame containing sufficient motion information using the adaptive frame routing mechanism AFR
Figure BDA0003882111240000047
And frames with insufficient motion information
Figure BDA0003882111240000048
Wherein T is a =T m +T c ;T m Indicating the number of sufficient frames of motion information, T c Indicating the number of frames for which motion information is insufficient.
The construction method of the motion segmentation network and the image segmentation network in the step 1 comprises the following steps:
step 1-4, constructing a motion segmentation network MS; for segmenting foreground objects from the motion representation;
step 1-5, constructing an image segmentation network CS; for segmenting foreground objects from RGB images.
The motion split network MS described in steps 1-4 comprises three components: a Convolutional Neural Network (CNN) encoder for extracting a feature representation; generating a model for generating foreground and background representations; a CNN decoder for decoding the foreground and background representations to a final output;
let X f For a single light-ray diagram, first the light-ray diagram X is f Sent to a CNN encoder phi enc And outputting a low-resolution feature:
Figure BDA0003882111240000049
wherein H 0 And W 0 Respectively representing the spatial dimension of the output features, and D representing the channel size;
for this feature F, the query vector is updated
Figure BDA00038821112400000410
A total of T times, wherein
Figure BDA00038821112400000411
Represents the query vector after the t-th update, q is the [0,1 ]]Is a category associated with the query embedding, 0 represents backScene, 1 represents the foreground; the query vector is learnable and is initialized by using a random weight;
Figure BDA00038821112400000412
wherein mu and sigma are mean and variance of Gaussian distribution, and d is the magnitude of weight vector; by Z (t) ∈R 2×d To represent query vectors of all categories; in the course of the subsequent process,
Figure BDA0003882111240000051
and
Figure BDA0003882111240000052
as a whole Z (t) Are updated at the same time;
query vector Z (t+1) Using features F and Z (t) Updating is carried out;
first, a 1 × 1 convolutional layer is used to reduce the channel of F and flatten the spatial dimension of F, resulting in a characteristic F':
Figure BDA0003882111240000053
wherein the characteristic length L = H 0 ×W 0 (ii) a Meanwhile, the position vector is added into F' to enhance the extraction of the spatial information; namely that
Figure BDA0003882111240000054
Where the PE is a position vector and where,
Figure BDA0003882111240000055
is a new feature representation after adding the position vector to F'; then using two multi-layer perceptron MLP layers
Figure BDA0003882111240000056
And
Figure BDA0003882111240000057
each layer is provided with three full-connection layer layers and a modified linear unit layer; calculate query value query, key value key using the self-attention mechanism:
Figure BDA0003882111240000058
obtaining the attention A through a normalized exponential function Softmax function (t)
Figure BDA0003882111240000059
The Attention mechanism calculates a weighted sum of features in the spatial dimension by:
Figure BDA00038821112400000510
query vector Z (t) Finally updated by a loop Gate Unit GRU (Gate recovery Unit, GRU) to:
Z (t+1) =GRU(U (t) ,Z (t) )
note U (t) And Z (t) An input state and a hidden state; iterating the generated model for 3 times, and outputting as O:
Figure BDA00038821112400000511
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00038821112400000512
a foreground query vector is represented that is,
Figure BDA00038821112400000513
representing a background query vector; in the decoding process, the two vectors are broadcast onto a spatial position-coded two-dimensional grid;
finally, the CNN decoder phi dec Separately decode { O f ,O b To original resolution:
Figure BDA00038821112400000514
Figure BDA00038821112400000515
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003882111240000061
is the optical flow field that is reconstructed,
Figure BDA0003882111240000062
is a reconstructed foreground optical flow field,
Figure BDA0003882111240000063
is the reconstructed background light flow field. Alpha is alpha fore Is the MASK image corresponding to the foreground optical flow field, alpha back Is the MASK image corresponding to the background light flow field. The final reconstructed graph can be expressed as:
Figure BDA0003882111240000064
wherein, for { alpha foreback Use Softmax to ensure α foreback =1; the MS branch completes training in a self-supervision mode, and the loss function comprises reconstruction loss L rec And entropy regularization loss L ent
Figure BDA0003882111240000065
L ent (L ent The purpose of (2) is to make the mask binary, so that the final segmentation result can be obtained. ) Is defined as:
L ent =-(α fore ·log(α fore ))+(1-α bacb )·log(1-α back )
alpha mask alpha is in the form of a one-hot coded code; when alpha is fore And alpha back When the foreground and background can be clearly represented, L ent Is zero; when alpha is fore And alpha back Cannot represent foreground and background, and they represent values close to L ent Is at a maximum;
finally, the sequence X is obtained through the training mode F Corresponding result O F
The input to the image segmentation network CS described in steps 1-5 is a video sequence X R The output is O through a single image camouflaged object detection method (refer to: yunqiuLv, hanging Zhang, yuchao Dai, aixuan Li, bowen Liu, nick Barnes, and Dengging Fan, "Simulaneooufilly localization, segment and rank the camouflaged objects," in CVPR, 2021.) or a single image salient object detection method (refer to: qiabin Hou, ming-Ming Cheng, xiaoowei Hu, ali Borji, and Zhuowen Tu, "deep neighboring sampled object detection with short connections," TPAMI, 2019) R
The method for training the decoupling self-supervision video camouflage and salient object detection model in the step 2 comprises the following steps:
step 2-1, data preprocessing: enhancing data to be input into a training set of disguised and salient objects of a decoupling self-supervision video disguised and salient object detection model, wherein the training set of the disguised and salient objects is randomly turned and randomly cut;
2-2, training an AFR classifier by using a pseudo-motion generation module PMG module to form data, so that whether motion information contained in a section of video information is sufficient or not can be distinguished;
carrying out self-supervision training on the motion segmentation network MS, so that complete and accurate object detection can be completed according to the optical flow diagram;
the generation result of the motion segmentation network MS is used for carrying out supervision training on the image segmentation network CS block, so that the image segmentation network CS can complete and accurate object detection through RGB images;
the results of the motion segmentation network MS and the image segmentation network CS are mutually cross-supervised, so that the network can gradually generate complete and accurate camouflage and saliency object graphs, and the final network model parameters are stored after the network is trained repeatedly.
The method for inputting the target video to be detected into the trained decoupling self-supervision video camouflage and salient object detection model for detection in the step 3 comprises the following steps: and inputting the target image to be detected into a trained decoupling self-supervision video camouflage and salient object detection model for reasoning to obtain corresponding camouflage and salient object segmentation images.
Has the advantages that:
the decoupling idea is provided, and instead of directly fusing context information and motion information to complete detection, two independent networks are designed to respectively utilize optical flow information and context information in a video sequence to complete detection. Meanwhile, in order to further widen the use scene of the network, a network model in an automatic supervision form is designed, so that the network model provided by the text can complete the detection task without marking data.
Drawings
The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a schematic process flow diagram of the present invention.
Fig. 2 is a schematic diagram of a pseudo motion generation process.
FIG. 3 is a schematic diagram of the detection results of the present invention.
Detailed Description
A video camouflage and salient object detection method based on decoupling self-supervision is disclosed, as shown in figure 1, and comprises the following steps:
step 1, constructing an adaptive frame routing mechanism (AFR): building a self-adaptive frame routing mechanism, and distinguishing which video frames have insufficient motion information and which video frames have sufficient motion information in a section of video;
step 2, constructing a motion segmentation network and an image segmentation network: and (4) selecting a video frame with sufficient motion information by using the AFR mechanism in the step (1) and sending the video frame into a motion segmentation network. And simultaneously, selecting a video frame with insufficient motion information by using the AFR, and sending the video frame into an image segmentation network. Finally, the results of the motion segmentation network and the image segmentation network are spliced together to obtain the detection result of a section of video frame;
step 3, the decoupling self-supervision video camouflage and salient object detection model provided by the method comprises a training stage and a testing stage: in the model training phase, a segment of camouflaged and salient object training video set is input into a decoupling self-supervision network, so that a frame routing mechanism (AFR), a motion segmentation network and an image segmentation network are trained. In the testing stage of the model, inputting a video containing a salient/camouflaged object to be detected into the trained model to obtain a corresponding salient/camouflaged object detection result;
wherein, step 1 includes the following steps:
step 1-1, generating a training sample for training AFR.
Optical flow information is used herein to represent motion information for each sample, where when motion information for a frame is sufficient, its corresponding optical flow map is Easily Decomposed (EDP). If the motion information of a frame is not sufficient, the corresponding optical flow graph is not easy to be decomposed (HDP). Therefore, in order to train the AFR, the document needs to create its corresponding training sample. For EDP frames, directly taking an optical flow graph corresponding to the training video; the HDP frame is generated by a completely new pseudo-motion generation module (PMG), the process of which is shown in fig. 2. The specific flow of PMG is as follows:
selecting a static picture and a speed parameter s, cutting out a sequence u e R from the static picture N×L×L And N and L are the number of the cropped serial frames and the image size, respectively. Speed parameter s = (v) x ,v y ) The moving distance of the cropped frames in the horizontal and vertical directions is determined. v. of x 、v y Selected from the set S = { -K, …, -1,0,1, …, K }. Where K represents the maximum value of the speed.
For an input image of size H × W, the distance is moved
Figure BDA0003882111240000081
Is defined as follows:
Figure BDA0003882111240000082
Figure BDA0003882111240000083
then, a clipping start point p is randomly selected on the image start If the clipping end point is p end =p start + D. Cutting out the sequence u, and covering the sequence u on the p of the original image start Obtaining the sequence u' epsilon R at the position N×H×W . Finally, the image sequence u ' is converted into an optical flow sequence u ' using RAFT ' f Thereby obtaining an HDP frame. Finally, we use asymmetric loss training AFR, which is defined as:
L q (u)=[(a+1) b -(a+u) b ]/b
where a =1 and b =2,u is the cross-entropy loss, which can be expressed as:
Figure BDA0003882111240000084
and 1-2, identifying whether the motion information of each frame in a video is sufficient by using the trained AFR.
The input of the decoupling self-monitoring network model is a video sequence
Figure BDA0003882111240000091
And corresponding optical flow sequence
Figure BDA0003882111240000092
Wherein T is a For the number of frames input, H, W is the image size of the input frame. We use AFR to select frames that contain sufficient motion information
Figure BDA0003882111240000093
And frames with insufficient motion information
Figure BDA0003882111240000094
Figure BDA0003882111240000095
Wherein T is a =T m +T c
The step 2 comprises the following steps:
step 2-1, constructing a motion segmentation network MS for segmenting foreground objects from motion representations;
the motion split network MS comprises three components: a CNN encoder extracts feature representation; 2. a model is generated, and the model is generated,
for generating foreground and background representations; a CNN decoder to decode the foreground and background representations, respectively, to a final output. To simplify the explanation of the procedure, we show a single light ray diagram X f For example. First, a light beam pattern X f Sent to a CNN encoder
Code device phi enc It outputs a low resolution feature:
Figure BDA0003882111240000096
wherein H 0 And W 0 Respectively representing the spatial dimensions of the output features. D denotes the channel size.
For this feature F, we propose to generate model update query vectors
Figure BDA0003882111240000097
For a time of T, wherein
Figure BDA0003882111240000098
Represents the query vector after the t-th update, q ∈ [0,1 ∈]Is a category relevant to the query embedding. "0" represents the background and "1" represents the foreground. The query vector is learnable and is initialized with random weights:
Figure BDA0003882111240000099
wherein, mu and sigma are mean and variance of Gaussian distribution, and d is the magnitude of weight vector. By Z (t) ∈R 2×d To represent query vectors of all categories. In the course of the subsequent processes, it is,
Figure BDA00038821112400000910
and
Figure BDA00038821112400000911
will be taken as a whole Z (t) And is updated at the same time.
Query vector Z (t+1) Using features F and Z (t) And (6) updating. First, a 1 × 1 convolutional layer is used to reduce the channel of F and flatten the spatial dimension of F, resulting in a characteristic F':
Figure BDA00038821112400000912
wherein, L = H 0 ×W 0 . At the same time, the position vector is added to F' to enhance the extraction of spatial information. Namely, it is
Figure BDA00038821112400000913
Figure BDA00038821112400000914
Where PE is a location vector. Then two MLP layers are used
Figure BDA00038821112400000915
And
Figure BDA00038821112400000916
each layer has three FC layers and a ReLU layer. The design is to make the self-attribute mechanism have higher flexibility when calculating the query and the key:
Figure BDA00038821112400000917
obtaining attention A through Softmax function (t)
Figure BDA0003882111240000101
The Attention mechanism calculates a weighted sum of features in the spatial dimension by:
Figure BDA0003882111240000102
query vector Z (t) Finally updated by GRU as:
Z (t+1) =GRU(U (t) ,Z (t) )
note U (t) And Z (t) An input state and a hidden state. The generative model is iterated 3 times, with the output being O:
Figure BDA0003882111240000103
wherein the content of the first and second substances,
Figure BDA0003882111240000104
a foreground query vector is represented that is,
Figure BDA0003882111240000105
representing a background query vector. During decoding, the two vectors are broadcast onto a two-dimensional grid with a learnable spatial position code.
Finally, the CNN decoder phi dec Separately decode { O f ,O b To original resolution:
Figure BDA0003882111240000106
Figure BDA0003882111240000107
wherein the content of the first and second substances,
Figure BDA0003882111240000108
is the reconstructed optical flow field, alpha fore Is the corresponding MASK image. Thus, the final reconstructed image is:
Figure BDA0003882111240000109
wherein, for { alpha foreback Use Softmax to ensure α foreback =1. In order to make the MS branch capable of completing training in a self-supervision mode, the loss function comprises reconstruction loss L rec And entropy regularization loss L ent
Figure BDA00038821112400001010
L enet The purpose of (2) is to make the mask binary, and we can get the final segmentation result. L is ent Is defined as:
L ent =-(α fore ·log(α fore ))+(1-α bacb )·log(1-α back )
as can be seen from this loss, L is the one-hot code when the alpha mask is in one-hot form ent Will be zero, when their probabilities are equal, L ent Will be the largest. Finally, in this way, the sequence X can be obtained here F Corresponding result O F
And 2-2, constructing an image segmentation network CS for segmenting the foreground object from the RGB image.
Input to CS is a video sequence X R The output is O by the existing single image camouflaged object detection method or single image salient object detection method R
Step 3 comprises a training phase and a testing phase:
step 3-1, data preprocessing: data enhancement such as random turning, random cutting and the like is carried out on a camouflage object training set and a saliency object training set to be input into a decoupling self-monitoring network;
step 3-2, firstly, generating data by using a PMG module root to train an AFR classifier so that the AFR classifier can distinguish whether motion information contained in a section of video information is sufficient or not; secondly, the MS module is self-supervised trained, so that complete and accurate object detection can be completed according to the optical flow diagram. Thirdly, the CS module is supervised and trained by using the generated result of the MS module, so that the CS module can complete and accurate object detection through the RGB image. Fourthly, the results of the MS and the CS are mutually cross-supervised, so that the network can gradually generate complete and accurate camouflage and significance object graphs, and the final network model parameters are stored after the network is trained repeatedly for multiple rounds;
step 3-3, testing a model framework: and inputting the images into a trained decoupling self-monitoring network for reasoning to obtain corresponding disguised and significant object segmentation images.
Example (b):
a video camouflage and salient object detection method based on decoupling self-supervision is implemented according to the following steps as shown in figure 1:
1. constructing a decoupling self-monitoring network G:
inputting: a video collection of camouflaged or salient objects.
And (3) outputting: a corresponding disguised or salient object segmentation image, and a loss function.
1.1 constructing a decoupling self-monitoring network model framework to extract optical flow;
the network input provided by the decoupling self-monitoring network model skeleton is a video sequence
Figure BDA0003882111240000111
And corresponding optical flow sequence
Figure BDA0003882111240000112
X F An auto-optical flow estimation algorithm RAFT is extracted. Wherein T is a For the number of frames input, H, W is the image size of the input frame.
1.2 an adaptive frame routing mechanism AFR is designed to distinguish which frames have sufficient optical flow information and which frames have insufficient optical flow information. The frames with sufficient optical flow information are sent to a motion segmentation network (MS) and the optical flow information is used to obtain the corresponding segmentation result. Frames with insufficient optical flow information are sent to an image segmentation network (CS) and corresponding segmentation results are obtained by using RGB image information. And then calculating a loss function by using the segmentation result, and performing parameter optimization.
2. Training an integral framework;
the deep learning convolutional neural network training parameters based on the double branches comprise data preprocessing, model framework training and testing stages.
3.1, preprocessing data;
and carrying out adjustment such as pull-up, inversion and the like on the input video set of the camouflage object and the salient object, and inputting the adjusted video set into a decoupling self-monitoring network.
Inputting: video collections of camouflaged and salient objects.
And (3) outputting: a video collection of data enhanced camouflaged and salient objects.
Geometric reinforcement: the generalization capability of the model can be enhanced by methods of changing the image geometry such as translation, rotation and shearing;
3.2 model framework training
Inputting: data enhanced video collection of camouflaged and salient objects
And (3) outputting: video set segmentation results of camouflaged and salient objects and a loss function.
During training, a small batch Stochastic Gradient Descent (SGD) optimization algorithm with a batchsize of 32, a momentum of 0.9, and a weight decay of 1e-5 may be used. The learning rate is set to 1e-4 and the maximum epoch is set to 100. The training image is adjusted to 352 x 352 as input to the entire network.
3.3 testing the model framework;
inputting: a video set of camouflaged and salient objects;
and (3) outputting: corresponding camouflage and saliency object cut images;
the model detection effect in the present invention is shown in fig. 3, which shows a total of 6 video sequences. Wherein, sequence 1 to sequence 3 represent saliency detection video sequences, and sequence 4 to sequence 6 represent masquerading detection video sequences. For each sequence, the first line represents the input video sequence, the second line represents the segmentation result, and the third line represents the optical flow information for each frame of video. The optical flow information of the first three columns is sufficient, the model completes the segmentation in the MS by using the optical flow information, the video information of the second two columns is insufficient in motion, and the model completes the segmentation in the CS by using the RGB picture information.
In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, and the computer program, when executed by the data processing unit, may execute the inventive content of the video camouflaging and salient object detection method based on decoupled self-supervision and some or all of the steps in each embodiment provided by the present invention. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
It is clear to those skilled in the art that the technical solutions in the embodiments of the present invention can be implemented by means of a computer program and its corresponding general-purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a computer program, that is, a software product, which may be stored in a storage medium and include several instructions to enable a device (which may be a personal computer, a server, a single chip microcomputer MUU or a network device, etc.) including a data processing unit to execute the method described in each embodiment or some portions of the embodiments of the present invention.
The present invention provides a thought and a method for video camouflage and salient object detection based on decoupling self-supervision, and a plurality of methods and ways for implementing the technical solution are provided, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, without departing from the principle of the present invention, a plurality of improvements and embellishments can be made, and should be regarded as the protection scope of the present invention. All the components not specified in this embodiment can be implemented by the prior art.

Claims (10)

1. A video camouflage and salient object detection method based on decoupling self-supervision is characterized by comprising the following steps:
step 1, constructing a decoupling self-supervision video camouflage and salient object detection model; the model comprises: an adaptive frame routing mechanism, a motion segmentation network and an image segmentation network;
the self-adaptive frame routing mechanism is used for carrying out sufficiency judgment on motion information of video frames in a target video;
sending the video frame with sufficient motion information selected by the self-adaptive frame routing mechanism into the motion segmentation network for processing; sending the video frames with insufficient motion information selected by the self-adaptive frame routing mechanism into the image segmentation network for processing;
combining the processing results of the motion segmentation network and the image segmentation network together to obtain the corresponding detection result of the video frame in the target video;
step 2, training the decoupling self-supervision video camouflage and salient object detection model: inputting a camouflage and salient object training video set into the decoupling self-monitoring video camouflage and salient object detection model, training an adaptive frame routing mechanism, a motion segmentation network and an image segmentation network, and performing iterative optimization on the decoupling self-monitoring video camouflage and salient object detection model;
and 3, inputting the target video to be detected into the trained decoupling self-supervision video camouflage and salient object detection model for detection, and completing the decoupling self-supervision-based video camouflage and salient object detection.
2. The method for detecting video masquerading and salient objects based on decoupling self-supervision as claimed in claim 1, wherein the method for training and constructing the adaptive frame routing mechanism AFR in step 1 comprises the following steps:
step 1-1, generating a training sample for training an adaptive frame routing mechanism (AFR);
step 1-2, training an adaptive frame routing mechanism (AFR);
and 1-3, identifying whether the motion information of each frame in the target video is sufficient by using a trained adaptive frame routing mechanism AFR.
3. The method for detecting video camouflaging and salient objects based on decoupled self-supervision as claimed in claim 2, wherein the training samples in step 1-1 comprise: an easily decomposed optical flow map EDP frame and an easily decomposed optical flow map HDP frame;
wherein, for the easily decomposed EDP frame, directly taking the light flow graph corresponding to the video in the training set; for the optical flow diagram HDP frame which is not easy to be decomposed, generating by a false action generating module PMG, wherein the generating process comprises the following steps:
selecting a static picture as an input image, and cutting out a sequence u' epsilon R N×L×L N and L are respectively the number of the sequence frames and the size of the image, and R represents the resolution of the sequence u; speed parameter s = (v) x ,v y ) Determining the moving distance of the cut frame in the horizontal and vertical directions; moving speed v in horizontal direction x And a moving speed v in the vertical direction y Selecting from the set S = { -K., -1,0,1, · K }; wherein K represents the maximum value of the speed;
for an input image of size H × W, the distance is moved
Figure FDA0003882111230000021
Is defined as follows:
Figure FDA0003882111230000022
Figure FDA0003882111230000023
wherein D is x Representing the displacement in the horizontal x-direction, D y Represents displacement in the vertical y-direction;
randomly selecting a cropping start point p on the input image start The end point of cutting is
Figure FDA0003882111230000024
Obtaining an image sequence u' through cutting; finally, the image sequence u 'is converted into an optical flow sequence u' f And obtaining the optical flow diagram HDP frame which is not easy to decompose.
4. The method for detecting video masquerading and salient objects based on decoupling self-supervision as claimed in claim 3, wherein the method for training adaptive frame routing mechanism AFR in step 1-2 comprises:
training the adaptive frame routing mechanism AFR using asymmetric losses, the asymmetric losses being defined as:
L q (u)=[(a+1) b -(a+u) b ]/b
wherein the first parameter a =1 and the second parameter b =2; u is the cross entropy loss, expressed as:
Figure FDA0003882111230000025
wherein y represents the true label of a training sample,
Figure FDA0003882111230000026
representing a predictive label for a training sample.
5. The video camouflaging and salient object detection method based on decoupled self-supervision as claimed in claim 4, wherein the identification method in steps 1-3 comprises:
the input of the decoupling self-supervision video camouflage and salient object detection model is a video sequence
Figure FDA0003882111230000027
Figure FDA0003882111230000028
And corresponding optical flow sequence
Figure FDA0003882111230000029
Wherein, T a For the number of input frames, H and W are image sizes of input frames,
Figure FDA00038821112300000210
representing a resolution size of a video frame;
selecting a frame containing sufficient motion information using the adaptive frame routing mechanism AFR
Figure FDA0003882111230000031
And frames with insufficient motion information
Figure FDA0003882111230000032
Wherein T is a =T m +T c ;T m Indicating the number of sufficient frames of motion information, T c Indicating the number of frames for which motion information is insufficient.
6. The method for detecting video camouflaging and salient objects based on decoupling self-supervision as claimed in claim 5, wherein the method for constructing the motion segmentation network and the image segmentation network in step 1 comprises the following steps:
step 1-4, constructing a motion segmentation network MS; for segmenting foreground objects from the motion representation;
step 1-5, constructing an image segmentation network CS; for segmenting foreground objects from RGB images.
7. The method for video camouflaging and salient object detection based on decoupled self-supervision according to claim 6, wherein the motion segmentation network MS in steps 1-4 comprises three components: a convolutional neural network encoder for extracting a feature representation; generating a model for generating foreground and background representations; a CNN decoder for decoding the foreground and background representations to a final output;
let X f For a single light-ray diagram, first the light-ray diagram X is f Sent to a CNN encoder phi enc And outputting a low-resolution feature:
Figure FDA0003882111230000033
wherein H 0 And W 0 Respectively representing the spatial dimension of the output features, and D representing the channel size;
for this feature F, the query vector is updated
Figure FDA0003882111230000034
A total of T times, wherein
Figure FDA0003882111230000035
Represents the query vector after the t-th update, q ∈ [0,1 ∈]Is a category relevant to the query embedding, 0 represents background, 1 represents foreground; initializing the query vector by using a random weight;
Figure FDA0003882111230000036
wherein, mu and sigma are mean and variance of Gaussian distribution, and d is the size of the weight vector; by Z (t) ∈R 2×d To represent query vectors of all categories; in the course of the subsequent process,
Figure FDA0003882111230000037
and
Figure FDA0003882111230000038
as a whole Z (t) Are updated at the same time;
query vector Z (t+1) Using features F and Z (t) Updating is carried out;
first, a 1 × 1 convolutional layer is used to reduce the channel of F and flatten the spatial dimension of F, resulting in a characteristic F':
Figure FDA0003882111230000039
wherein the characteristic length L = H 0 ×W 0 (ii) a Meanwhile, the position vector is added into F' to enhance the extraction of the spatial information; namely that
Figure FDA0003882111230000041
Where the PE is a position vector of the position,
Figure FDA0003882111230000042
is a new feature representation after adding the position vector to F'; then using two multi-layer perceptron MLP layers
Figure FDA0003882111230000043
And
Figure FDA0003882111230000044
each layer is provided with three full-connection layers and a correction linear unit layer; calculate query value query, key value key using the self-attention mechanism:
Figure FDA0003882111230000045
obtaining the attention A through a normalized exponential function Softmax function (t)
Figure FDA0003882111230000046
The Attention mechanism calculates a weighted sum of features in the spatial dimension by:
Figure FDA0003882111230000047
query vector Z (t) Finally updated by the loop gate unit GRU to:
Z (t+1) =GRU(U (t) ,Z (t) )
note U (t) And Z (t) An input state and a hidden state; iterating the generated model for 3 times, and outputting as O:
Figure FDA0003882111230000048
wherein the content of the first and second substances,
Figure FDA0003882111230000049
a foreground query vector is represented that is,
Figure FDA00038821112300000410
representing a background query vector; during decoding, the two vectors are broadcast onto a spatial position-coded two-dimensional grid;
finally, the CNN decoder phi dec Separately decode { O f ,O b To original resolution:
Figure FDA00038821112300000411
Figure FDA00038821112300000412
wherein the content of the first and second substances,
Figure FDA00038821112300000413
is the optical flow field that is reconstructed,
Figure FDA00038821112300000414
is a reconstructed foreground optical flow field,
Figure FDA00038821112300000415
is a reconstructed background light flow field; alpha is alpha fore Is the MASK image corresponding to the foreground optical flow field, alpha back Is a MASK image corresponding to the background light flow field; the final reconstructed graph can be expressed as:
Figure FDA00038821112300000416
wherein, for { alpha fore ,α back Use Softmax to ensure α foreback =1; the MS branch completes training in a self-supervision mode, and the loss function comprises reconstruction loss L rec And entropy regularization loss L ent
Figure FDA0003882111230000051
L ent Is defined as:
L ent =-(α fore ·log(α fore ))+(1-α back )·log(1-α back )
alpha mask alpha is in the form of a one-hot coded code; when alpha is fore And alpha back When the foreground and background can be clearly represented, L ent Is zero; when alpha is fore And alpha back Cannot represent foreground and background, and they represent values close to L ent Is at a maximum;
finally, the sequence X is obtained through the training mode F Corresponding toResults O F
8. The method for detecting video masquerading and salient objects based on decoupling self-supervision as claimed in claim 7, wherein the input of the image segmentation network CS in steps 1-5 is a video sequence X R The output is O through a single image camouflaged object detection method or a single image salient object detection method R
9. The method for detecting video masquerading and salient objects based on decoupling self-supervision as claimed in claim 8, wherein the method for training the video masquerading and salient object detection model based on decoupling self-supervision in step 2 comprises:
step 2-1, data preprocessing: enhancing data to be input into a training set of disguised and salient objects of a decoupling self-supervision video disguised and salient object detection model, wherein the training set of the disguised and salient objects is randomly turned and randomly cut;
2-2, training an AFR classifier by using a pseudo-motion generation module PMG module to form data, so that whether motion information contained in a section of video information is sufficient or not can be distinguished;
carrying out self-supervision training on the motion segmentation network MS, so that complete and accurate object detection can be completed according to the optical flow diagram;
the generation result of the motion segmentation network MS is used for carrying out supervision training on the image segmentation network CS block, so that the image segmentation network CS can complete and accurate object detection through RGB images;
the results of the motion segmentation network MS and the image segmentation network CS are mutually crossed and supervised, so that the network can gradually generate a complete and accurate camouflage and saliency object map, and the final network model parameters are stored after the network is trained repeatedly for multiple rounds.
10. The method for detecting video masquerading and salient objects based on decoupling and self-supervision as claimed in claim 9, wherein the method for inputting the target video to be detected into the trained decoupling and self-supervision video masquerading and salient object detection model in step 3 comprises: and inputting the target image to be detected into a trained decoupling self-supervision video camouflage and salient object detection model for reasoning to obtain corresponding camouflage and salient object segmentation images.
CN202211232708.0A 2022-10-10 2022-10-10 Video camouflage and salient object detection method based on decoupling self-supervision Pending CN115565108A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211232708.0A CN115565108A (en) 2022-10-10 2022-10-10 Video camouflage and salient object detection method based on decoupling self-supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211232708.0A CN115565108A (en) 2022-10-10 2022-10-10 Video camouflage and salient object detection method based on decoupling self-supervision

Publications (1)

Publication Number Publication Date
CN115565108A true CN115565108A (en) 2023-01-03

Family

ID=84745836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211232708.0A Pending CN115565108A (en) 2022-10-10 2022-10-10 Video camouflage and salient object detection method based on decoupling self-supervision

Country Status (1)

Country Link
CN (1) CN115565108A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116935189A (en) * 2023-09-15 2023-10-24 北京理工导航控制科技股份有限公司 Camouflage target detection method and device based on neural network and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116935189A (en) * 2023-09-15 2023-10-24 北京理工导航控制科技股份有限公司 Camouflage target detection method and device based on neural network and storage medium
CN116935189B (en) * 2023-09-15 2023-12-05 北京理工导航控制科技股份有限公司 Camouflage target detection method and device based on neural network and storage medium

Similar Documents

Publication Publication Date Title
US11200424B2 (en) Space-time memory network for locating target object in video content
CN106960206B (en) Character recognition method and character recognition system
CN109711463B (en) Attention-based important object detection method
CN112750140B (en) Information mining-based disguised target image segmentation method
CN110570433B (en) Image semantic segmentation model construction method and device based on generation countermeasure network
CN113591968A (en) Infrared weak and small target detection method based on asymmetric attention feature fusion
CN112884802B (en) Attack resistance method based on generation
CN112801068B (en) Video multi-target tracking and segmenting system and method
CN116311214B (en) License plate recognition method and device
Chen et al. Finding arbitrary-oriented ships from remote sensing images using corner detection
CN113065550A (en) Text recognition method based on self-attention mechanism
CN114037640A (en) Image generation method and device
CN114283352A (en) Video semantic segmentation device, training method and video semantic segmentation method
CN111325766A (en) Three-dimensional edge detection method and device, storage medium and computer equipment
US20230154139A1 (en) Systems and methods for contrastive pretraining with video tracking supervision
CN114140831B (en) Human body posture estimation method and device, electronic equipment and storage medium
CN115565108A (en) Video camouflage and salient object detection method based on decoupling self-supervision
CN112149526A (en) Lane line detection method and system based on long-distance information fusion
CN115861756A (en) Earth background small target identification method based on cascade combination network
Hughes et al. A semi-supervised approach to SAR-optical image matching
CN116665114A (en) Multi-mode-based remote sensing scene identification method, system and medium
WO2023185074A1 (en) Group behavior recognition method based on complementary spatio-temporal information modeling
CN111209886A (en) Rapid pedestrian re-identification method based on deep neural network
CN113780241B (en) Acceleration method and device for detecting remarkable object
CN115965968A (en) Small sample target detection and identification method based on knowledge guidance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination