CN115565108A

CN115565108A - Video camouflage and salient object detection method based on decoupling self-supervision

Info

Publication number: CN115565108A
Application number: CN202211232708.0A
Authority: CN
Inventors: 黄明江; 李文丽; 孙德生; 薛豪奇; 赵鑫; 陈伟; 邢星
Original assignee: Xuchang University
Current assignee: Xuchang University
Priority date: 2022-10-10
Filing date: 2022-10-10
Publication date: 2023-01-03

Abstract

The invention provides a video camouflage and salient object detection method based on decoupling self-supervision, which comprises the following steps: 1, a frame routing mechanism in an automatic supervision form is constructed, which can help a network to identify which video frames in a section of video have abundant motion information and which video frames have insufficient motion information. 2, constructing a motion segmentation network and an image segmentation network in a self-supervision mode, and detecting a disguised/significant object by using the motion segmentation network when motion information in a certain frame of video is sufficient; when the motion information in a certain frame is insufficient, detecting a disguised/significant object by using an image segmentation network; 3, the decoupling self-monitoring network framework built in the method can simultaneously complete the detection of the video camouflage object and the salient object without depending on any data annotation.

Description

Video camouflage and salient object detection method based on decoupling self-supervision

Technical Field

The invention relates to a video camouflage and salient object detection method, in particular to a video camouflage and salient object detection method based on decoupling self-supervision.

Background

In recent years, with the rapid development of deep convolutional networks, masquerading and salient object detection make a great breakthrough. Compared with the traditional camouflage and salient object detection algorithm, the accuracy of the camouflage and salient object detection method based on deep learning is greatly improved, the high-level semantic information of the image can be acquired through the deep neural network, and the camouflage and salient objects in the video can be more accurately detected by utilizing the information. For example, the documents Xiankai Lu, wenguan Wang, chao Ma, jianbin Shen, ling Shao, and FatihParkli, "See More, know More: unsuperved Video Object Segmentation With Co-attachment site Networks", CVin PR,2019, halaLamdouar, charig Yang, weitixie, and Andrew Zisserman, "Betrayedby Motion: catalytic Object Segmentation Detection", in ACCV,2020, and Miao Zhang, jie Liu, yifei Wang, yongpi Piao, shuunu Yao, jie Liu, jie Yang, jie Wang, yingju Wang, and "Weitiu Liu Yang," Weitiu Xueyu, weitiu Wang, wei Xuniu, wei Xuanz, and Weitiu Ziniu, so that the results of the direct and robust Detection of the neural network information by using the translation of the respective volumes, the Video information, the two types of the neural network design, the neural network, the robust Detection by using the neural information, the translation of the translation and the environmental information.

Although the accuracy of the detection of the masquerading and salient objects can be further improved by the improvement of the network structure by the methods, the methods have the defect that the motion information of frames in a video sequence cannot be correctly identified is insufficient, and the detection performance of the network is reduced by directly fusing the insufficient optical flow information and the context information. Therefore, in order to solve this problem, a decoupling concept is proposed herein, instead of directly fusing context information and motion information to complete detection, two independent networks are designed, and optical flow information and context information in a video sequence are respectively utilized to complete detection. Meanwhile, in order to further widen the use scene of the network, a network model in an automatic supervision form is designed, so that the network model provided by the text can complete the detection task without marking data.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the technical problem of the prior art and provides a video camouflage and salient object detection method based on decoupling self-supervision.

In order to solve the technical problems, the invention discloses a video camouflage and salient object detection method based on decoupling self-supervision.

The method disclosed by the invention firstly designs a frame routing mechanism, so that the motion information of frames in a video sequence is correctly identified to be insufficient, and the motion information of the frames is sufficient. Meanwhile, two independent networks, a motion segmentation network and an image segmentation network, are designed. The motion segmentation network is used for processing video frames with sufficient optical flow information, and the optical flow information of the video frames is input to obtain corresponding detection results. The image segmentation network is used for processing video frames with insufficient optical flow information, and inputting RGB image information of the video frames to obtain corresponding detection results.

The method comprises the following specific steps:

step 1, constructing a decoupling self-supervision video camouflage and salient object detection model; the model comprises: an adaptive frame routing mechanism, a motion segmentation network and an image segmentation network;

the self-adaptive frame routing mechanism is used for carrying out sufficiency judgment on motion information of video frames in a target video;

sending the video frame with sufficient motion information selected by the self-adaptive frame routing mechanism into the motion segmentation network for processing; sending the video frames with insufficient motion information selected by the self-adaptive frame routing mechanism into the image segmentation network for processing;

combining the processing results of the motion segmentation network and the image segmentation network together to obtain the corresponding detection result of the video frame in the target video;

step 2, training the decoupling self-supervision video camouflage and salient object detection model: inputting a camouflage and salient object training video set into the decoupling self-monitoring video camouflage and salient object detection model, training an adaptive frame routing mechanism, a motion segmentation network and an image segmentation network, and performing iterative optimization on the decoupling self-monitoring video camouflage and salient object detection model;

and 3, inputting the target video to be detected into the trained decoupling self-monitoring video camouflage and salient object detection model for detection, and completing the decoupling self-monitoring-based video camouflage and salient object detection.

The training and construction method of the adaptive frame routing mechanism AFR in the step 1 comprises the following steps:

step 1-1, generating a training sample for training an adaptive frame routing mechanism (AFR);

step 1-2, training an adaptive frame routing mechanism AFR;

and 1-3, identifying whether the motion information of each frame in the target video is sufficient by using a trained adaptive frame routing mechanism AFR.

The training sample in step 1-1 comprises: the optical flow map EDP frame which is easy to decompose, and the optical flow information contained in the frame is relatively clear, so that the frame is relatively easy to decompose; and optical flow diagram HDP frames which are not easy to decompose, and optical flow information in the frames is relatively disordered and is not easy to decompose;

among them, the optical flow map corresponding to the video in the training set is directly obtained for the optical flow map EDP (easy Decomposed) frame which is easy to decompose, because the motion information of most video frames in the training set is sufficient and easy to decompose; for an optical flow graph HDP (Hardly composed) frame which is not easy to decompose, generating by a Pseudo Motion Generation module PMG (Pseudo Motion Generation), wherein the generating process comprises the following steps:

selecting a static picture as an input image, and cutting out a sequence u' epsilon R ^N×L×L N and L are respectively the number of the sequence frames and the image size which are cut out, and R represents the resolution of the sequence u; speed parameter a = (v) _x ,v _y ) Determining the moving distance of the cut frame in the horizontal and vertical directions; moving speed v in horizontal direction _x And a moving speed v in the vertical direction _y Selected from the set S = { -K, …, -1,0,1, …, K }; wherein K represents the maximum value of the speed;

for an input image of size H × W, the distance is moved

Is defined as:

wherein D is _x Representing the displacement in the horizontal x-direction, D _y Represents displacement in the vertical y-direction;

randomly selecting a cropping start point p on the input image _start If the cutting end point is

Obtaining an image sequence u' through cutting; finally, converting the image sequence u ' into an optical flow sequence u ' by using the existing optical flow detection algorithm ' _f And obtaining the optical flow diagram HDP frame which is not easy to decompose.

Through the steps, a series of HDP frames and EDP frames are obtained and used for training AFR. It is noted that if the motion information of a frame of video is sufficient, its corresponding optical flow graph is easy to decompose. If the motion information is not sufficient, its corresponding light flow graph is also difficult to decompose. Therefore, after the AFR is trained by using the HDP frame and the EDP frame, the AFR can judge whether the motion information of a certain frame of image is sufficient

The method for training the adaptive frame routing mechanism AFR described in step 1-2 comprises:

training the adaptive frame routing mechanism AFR using asymmetric losses, the asymmetric losses defined as:

L _q (u)＝[(a+1) ^b -(a+u) ^b ]/b

wherein the first parameter a =1 and the second parameter b =2; u is the cross entropy loss, expressed as:

wherein y represents the true label of a training sample,

representing a predictive label for a training sample.

The identification method in the steps 1-3 comprises the following steps:

the input of the decoupling self-supervision video camouflage and salient object detection model is a video sequence

And corresponding optical flow sequence

Wherein, T _a For the number of input frames, H and W are image sizes of input frames,

representing a resolution size of a video frame;

selecting a frame containing sufficient motion information using the adaptive frame routing mechanism AFR

And frames with insufficient motion information

Wherein T is _a ＝T _m +T _c ；T _m Indicating the number of sufficient frames of motion information, T _c Indicating the number of frames for which motion information is insufficient.

The construction method of the motion segmentation network and the image segmentation network in the step 1 comprises the following steps:

step 1-4, constructing a motion segmentation network MS; for segmenting foreground objects from the motion representation;

step 1-5, constructing an image segmentation network CS; for segmenting foreground objects from RGB images.

The motion split network MS described in steps 1-4 comprises three components: a Convolutional Neural Network (CNN) encoder for extracting a feature representation; generating a model for generating foreground and background representations; a CNN decoder for decoding the foreground and background representations to a final output;

let X _f For a single light-ray diagram, first the light-ray diagram X is _f Sent to a CNN encoder phi _enc And outputting a low-resolution feature:

wherein H ₀ And W ₀ Respectively representing the spatial dimension of the output features, and D representing the channel size;

for this feature F, the query vector is updated

A total of T times, wherein

Represents the query vector after the t-th update, q is the [0,1 ]]Is a category associated with the query embedding, 0 represents backScene, 1 represents the foreground; the query vector is learnable and is initialized by using a random weight;

wherein mu and sigma are mean and variance of Gaussian distribution, and d is the magnitude of weight vector; by Z ^(t) ∈R ^2×d To represent query vectors of all categories; in the course of the subsequent process,

and

as a whole Z ^(t) Are updated at the same time;

query vector Z ^(t+1) Using features F and Z ^(t) Updating is carried out;

first, a 1 × 1 convolutional layer is used to reduce the channel of F and flatten the spatial dimension of F, resulting in a characteristic F':

wherein the characteristic length L = H ₀ ×W ₀ (ii) a Meanwhile, the position vector is added into F' to enhance the extraction of the spatial information; namely that

Where the PE is a position vector and where,

is a new feature representation after adding the position vector to F'; then using two multi-layer perceptron MLP layers

And

each layer is provided with three full-connection layer layers and a modified linear unit layer; calculate query value query, key value key using the self-attention mechanism:

obtaining the attention A through a normalized exponential function Softmax function ^(t) ：

The Attention mechanism calculates a weighted sum of features in the spatial dimension by:

query vector Z ^(t) Finally updated by a loop Gate Unit GRU (Gate recovery Unit, GRU) to:

Z ^(t+1) ＝GRU(U ^(t) ,Z ^(t) )

note U ^(t) And Z ^(t) An input state and a hidden state; iterating the generated model for 3 times, and outputting as O:

wherein, the first and the second end of the pipe are connected with each other,

a foreground query vector is represented that is,

representing a background query vector; in the decoding process, the two vectors are broadcast onto a spatial position-coded two-dimensional grid;

finally, the CNN decoder phi _dec Separately decode { O _f ,O _b To original resolution:

is the optical flow field that is reconstructed,

is a reconstructed foreground optical flow field,

is the reconstructed background light flow field. Alpha is alpha ^fore Is the MASK image corresponding to the foreground optical flow field, alpha ^back Is the MASK image corresponding to the background light flow field. The final reconstructed graph can be expressed as:

wherein, for { alpha ^fore ,α ^back Use Softmax to ensure α ^fore +α ^back =1; the MS branch completes training in a self-supervision mode, and the loss function comprises reconstruction loss L _rec And entropy regularization loss L _ent ：

L _ent (L _ent The purpose of (2) is to make the mask binary, so that the final segmentation result can be obtained. ) Is defined as:

L _ent ＝-(α ^fore ·log(α ^fore ))+(1-α ^bacb )·log(1-α ^back )

alpha mask alpha is in the form of a one-hot coded code; when alpha is ^fore And alpha ^back When the foreground and background can be clearly represented, L _ent Is zero; when alpha is ^fore And alpha ^back Cannot represent foreground and background, and they represent values close to L _ent Is at a maximum;

finally, the sequence X is obtained through the training mode _F Corresponding result O _F 。

The input to the image segmentation network CS described in steps 1-5 is a video sequence X _R The output is O through a single image camouflaged object detection method (refer to: yunqiuLv, hanging Zhang, yuchao Dai, aixuan Li, bowen Liu, nick Barnes, and Dengging Fan, "Simulaneooufilly localization, segment and rank the camouflaged objects," in CVPR, 2021.) or a single image salient object detection method (refer to: qiabin Hou, ming-Ming Cheng, xiaoowei Hu, ali Borji, and Zhuowen Tu, "deep neighboring sampled object detection with short connections," TPAMI, 2019) _R 。

The method for training the decoupling self-supervision video camouflage and salient object detection model in the step 2 comprises the following steps:

step 2-1, data preprocessing: enhancing data to be input into a training set of disguised and salient objects of a decoupling self-supervision video disguised and salient object detection model, wherein the training set of the disguised and salient objects is randomly turned and randomly cut;

2-2, training an AFR classifier by using a pseudo-motion generation module PMG module to form data, so that whether motion information contained in a section of video information is sufficient or not can be distinguished;

carrying out self-supervision training on the motion segmentation network MS, so that complete and accurate object detection can be completed according to the optical flow diagram;

the generation result of the motion segmentation network MS is used for carrying out supervision training on the image segmentation network CS block, so that the image segmentation network CS can complete and accurate object detection through RGB images;

the results of the motion segmentation network MS and the image segmentation network CS are mutually cross-supervised, so that the network can gradually generate complete and accurate camouflage and saliency object graphs, and the final network model parameters are stored after the network is trained repeatedly.

The method for inputting the target video to be detected into the trained decoupling self-supervision video camouflage and salient object detection model for detection in the step 3 comprises the following steps: and inputting the target image to be detected into a trained decoupling self-supervision video camouflage and salient object detection model for reasoning to obtain corresponding camouflage and salient object segmentation images.

Has the advantages that:

the decoupling idea is provided, and instead of directly fusing context information and motion information to complete detection, two independent networks are designed to respectively utilize optical flow information and context information in a video sequence to complete detection. Meanwhile, in order to further widen the use scene of the network, a network model in an automatic supervision form is designed, so that the network model provided by the text can complete the detection task without marking data.

Drawings

The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a schematic process flow diagram of the present invention.

Fig. 2 is a schematic diagram of a pseudo motion generation process.

FIG. 3 is a schematic diagram of the detection results of the present invention.

Detailed Description

A video camouflage and salient object detection method based on decoupling self-supervision is disclosed, as shown in figure 1, and comprises the following steps:

step 1, constructing an adaptive frame routing mechanism (AFR): building a self-adaptive frame routing mechanism, and distinguishing which video frames have insufficient motion information and which video frames have sufficient motion information in a section of video;

step 2, constructing a motion segmentation network and an image segmentation network: and (4) selecting a video frame with sufficient motion information by using the AFR mechanism in the step (1) and sending the video frame into a motion segmentation network. And simultaneously, selecting a video frame with insufficient motion information by using the AFR, and sending the video frame into an image segmentation network. Finally, the results of the motion segmentation network and the image segmentation network are spliced together to obtain the detection result of a section of video frame;

step 3, the decoupling self-supervision video camouflage and salient object detection model provided by the method comprises a training stage and a testing stage: in the model training phase, a segment of camouflaged and salient object training video set is input into a decoupling self-supervision network, so that a frame routing mechanism (AFR), a motion segmentation network and an image segmentation network are trained. In the testing stage of the model, inputting a video containing a salient/camouflaged object to be detected into the trained model to obtain a corresponding salient/camouflaged object detection result;

wherein, step 1 includes the following steps:

step 1-1, generating a training sample for training AFR.

Optical flow information is used herein to represent motion information for each sample, where when motion information for a frame is sufficient, its corresponding optical flow map is Easily Decomposed (EDP). If the motion information of a frame is not sufficient, the corresponding optical flow graph is not easy to be decomposed (HDP). Therefore, in order to train the AFR, the document needs to create its corresponding training sample. For EDP frames, directly taking an optical flow graph corresponding to the training video; the HDP frame is generated by a completely new pseudo-motion generation module (PMG), the process of which is shown in fig. 2. The specific flow of PMG is as follows:

selecting a static picture and a speed parameter s, cutting out a sequence u e R from the static picture ^N×L×L And N and L are the number of the cropped serial frames and the image size, respectively. Speed parameter s = (v) _x ,v _y ) The moving distance of the cropped frames in the horizontal and vertical directions is determined. v. of _x 、v _y Selected from the set S = { -K, …, -1,0,1, …, K }. Where K represents the maximum value of the speed.

For an input image of size H × W, the distance is moved

Is defined as follows:

then, a clipping start point p is randomly selected on the image _start If the clipping end point is p _end ＝p _start + D. Cutting out the sequence u, and covering the sequence u on the p of the original image _start Obtaining the sequence u' epsilon R at the position ^N×H×W . Finally, the image sequence u ' is converted into an optical flow sequence u ' using RAFT ' _f Thereby obtaining an HDP frame. Finally, we use asymmetric loss training AFR, which is defined as:

L _q (u)＝[(a+1) ^b -(a+u) ^b ]/b

where a =1 and b =2,u is the cross-entropy loss, which can be expressed as:

and 1-2, identifying whether the motion information of each frame in a video is sufficient by using the trained AFR.

The input of the decoupling self-monitoring network model is a video sequence

And corresponding optical flow sequence

Wherein T is _a For the number of frames input, H, W is the image size of the input frame. We use AFR to select frames that contain sufficient motion information

And frames with insufficient motion information

Wherein T is _a ＝T _m +T _c 。

The step 2 comprises the following steps:

step 2-1, constructing a motion segmentation network MS for segmenting foreground objects from motion representations;

the motion split network MS comprises three components: a CNN encoder extracts feature representation; 2. a model is generated, and the model is generated,

for generating foreground and background representations; a CNN decoder to decode the foreground and background representations, respectively, to a final output. To simplify the explanation of the procedure, we show a single light ray diagram X _f For example. First, a light beam pattern X _f Sent to a CNN encoder

Code device phi _enc It outputs a low resolution feature:

wherein H ₀ And W ₀ Respectively representing the spatial dimensions of the output features. D denotes the channel size.

For this feature F, we propose to generate model update query vectors

For a time of T, wherein

Represents the query vector after the t-th update, q ∈ [0,1 ∈]Is a category relevant to the query embedding. "0" represents the background and "1" represents the foreground. The query vector is learnable and is initialized with random weights:

wherein, mu and sigma are mean and variance of Gaussian distribution, and d is the magnitude of weight vector. By Z ^(t) ∈R ^2×d To represent query vectors of all categories. In the course of the subsequent processes, it is,

and

will be taken as a whole Z ^(t) And is updated at the same time.

Query vector Z ^(t+1) Using features F and Z ^(t) And (6) updating. First, a 1 × 1 convolutional layer is used to reduce the channel of F and flatten the spatial dimension of F, resulting in a characteristic F':

wherein, L = H ₀ ×W ₀ . At the same time, the position vector is added to F' to enhance the extraction of spatial information. Namely, it is

Where PE is a location vector. Then two MLP layers are used

And

each layer has three FC layers and a ReLU layer. The design is to make the self-attribute mechanism have higher flexibility when calculating the query and the key:

obtaining attention A through Softmax function ^(t) ：

query vector Z ^(t) Finally updated by GRU as:

Z ^(t+1) ＝GRU(U ^(t) ,Z ^(t) )

note U ^(t) And Z ^(t) An input state and a hidden state. The generative model is iterated 3 times, with the output being O:

wherein the content of the first and second substances,

a foreground query vector is represented that is,

representing a background query vector. During decoding, the two vectors are broadcast onto a two-dimensional grid with a learnable spatial position code.

wherein the content of the first and second substances,

is the reconstructed optical flow field, alpha ^fore Is the corresponding MASK image. Thus, the final reconstructed image is:

wherein, for { alpha ^fore ,α ^back Use Softmax to ensure α ^fore +α ^back =1. In order to make the MS branch capable of completing training in a self-supervision mode, the loss function comprises reconstruction loss L _rec And entropy regularization loss L _ent 。

L _enet The purpose of (2) is to make the mask binary, and we can get the final segmentation result. L is _ent Is defined as:

L _ent ＝-(α ^fore ·log(α ^fore ))+(1-α ^bacb )·log(1-α ^back )

as can be seen from this loss, L is the one-hot code when the alpha mask is in one-hot form _ent Will be zero, when their probabilities are equal, L _ent Will be the largest. Finally, in this way, the sequence X can be obtained here _F Corresponding result O _F 。

And 2-2, constructing an image segmentation network CS for segmenting the foreground object from the RGB image.

Input to CS is a video sequence X _R The output is O by the existing single image camouflaged object detection method or single image salient object detection method _R 。

Step 3 comprises a training phase and a testing phase:

step 3-1, data preprocessing: data enhancement such as random turning, random cutting and the like is carried out on a camouflage object training set and a saliency object training set to be input into a decoupling self-monitoring network;

step 3-2, firstly, generating data by using a PMG module root to train an AFR classifier so that the AFR classifier can distinguish whether motion information contained in a section of video information is sufficient or not; secondly, the MS module is self-supervised trained, so that complete and accurate object detection can be completed according to the optical flow diagram. Thirdly, the CS module is supervised and trained by using the generated result of the MS module, so that the CS module can complete and accurate object detection through the RGB image. Fourthly, the results of the MS and the CS are mutually cross-supervised, so that the network can gradually generate complete and accurate camouflage and significance object graphs, and the final network model parameters are stored after the network is trained repeatedly for multiple rounds;

step 3-3, testing a model framework: and inputting the images into a trained decoupling self-monitoring network for reasoning to obtain corresponding disguised and significant object segmentation images.

Example (b):

a video camouflage and salient object detection method based on decoupling self-supervision is implemented according to the following steps as shown in figure 1:

1. constructing a decoupling self-monitoring network G:

inputting: a video collection of camouflaged or salient objects.

And (3) outputting: a corresponding disguised or salient object segmentation image, and a loss function.

1.1 constructing a decoupling self-monitoring network model framework to extract optical flow;

the network input provided by the decoupling self-monitoring network model skeleton is a video sequence

And corresponding optical flow sequence

X _F An auto-optical flow estimation algorithm RAFT is extracted. Wherein T is _a For the number of frames input, H, W is the image size of the input frame.

1.2 an adaptive frame routing mechanism AFR is designed to distinguish which frames have sufficient optical flow information and which frames have insufficient optical flow information. The frames with sufficient optical flow information are sent to a motion segmentation network (MS) and the optical flow information is used to obtain the corresponding segmentation result. Frames with insufficient optical flow information are sent to an image segmentation network (CS) and corresponding segmentation results are obtained by using RGB image information. And then calculating a loss function by using the segmentation result, and performing parameter optimization.

2. Training an integral framework;

the deep learning convolutional neural network training parameters based on the double branches comprise data preprocessing, model framework training and testing stages.

3.1, preprocessing data;

and carrying out adjustment such as pull-up, inversion and the like on the input video set of the camouflage object and the salient object, and inputting the adjusted video set into a decoupling self-monitoring network.

Inputting: video collections of camouflaged and salient objects.

And (3) outputting: a video collection of data enhanced camouflaged and salient objects.

Geometric reinforcement: the generalization capability of the model can be enhanced by methods of changing the image geometry such as translation, rotation and shearing;

3.2 model framework training

Inputting: data enhanced video collection of camouflaged and salient objects

And (3) outputting: video set segmentation results of camouflaged and salient objects and a loss function.

During training, a small batch Stochastic Gradient Descent (SGD) optimization algorithm with a batchsize of 32, a momentum of 0.9, and a weight decay of 1e-5 may be used. The learning rate is set to 1e-4 and the maximum epoch is set to 100. The training image is adjusted to 352 x 352 as input to the entire network.

3.3 testing the model framework;

inputting: a video set of camouflaged and salient objects;

and (3) outputting: corresponding camouflage and saliency object cut images;

the model detection effect in the present invention is shown in fig. 3, which shows a total of 6 video sequences. Wherein, sequence 1 to sequence 3 represent saliency detection video sequences, and sequence 4 to sequence 6 represent masquerading detection video sequences. For each sequence, the first line represents the input video sequence, the second line represents the segmentation result, and the third line represents the optical flow information for each frame of video. The optical flow information of the first three columns is sufficient, the model completes the segmentation in the MS by using the optical flow information, the video information of the second two columns is insufficient in motion, and the model completes the segmentation in the CS by using the RGB picture information.

In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, and the computer program, when executed by the data processing unit, may execute the inventive content of the video camouflaging and salient object detection method based on decoupled self-supervision and some or all of the steps in each embodiment provided by the present invention. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.

It is clear to those skilled in the art that the technical solutions in the embodiments of the present invention can be implemented by means of a computer program and its corresponding general-purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a computer program, that is, a software product, which may be stored in a storage medium and include several instructions to enable a device (which may be a personal computer, a server, a single chip microcomputer MUU or a network device, etc.) including a data processing unit to execute the method described in each embodiment or some portions of the embodiments of the present invention.

The present invention provides a thought and a method for video camouflage and salient object detection based on decoupling self-supervision, and a plurality of methods and ways for implementing the technical solution are provided, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, without departing from the principle of the present invention, a plurality of improvements and embellishments can be made, and should be regarded as the protection scope of the present invention. All the components not specified in this embodiment can be implemented by the prior art.

Claims

1. A video camouflage and salient object detection method based on decoupling self-supervision is characterized by comprising the following steps:

and 3, inputting the target video to be detected into the trained decoupling self-supervision video camouflage and salient object detection model for detection, and completing the decoupling self-supervision-based video camouflage and salient object detection.

2. The method for detecting video masquerading and salient objects based on decoupling self-supervision as claimed in claim 1, wherein the method for training and constructing the adaptive frame routing mechanism AFR in step 1 comprises the following steps:

step 1-2, training an adaptive frame routing mechanism (AFR);

3. The method for detecting video camouflaging and salient objects based on decoupled self-supervision as claimed in claim 2, wherein the training samples in step 1-1 comprise: an easily decomposed optical flow map EDP frame and an easily decomposed optical flow map HDP frame;

wherein, for the easily decomposed EDP frame, directly taking the light flow graph corresponding to the video in the training set; for the optical flow diagram HDP frame which is not easy to be decomposed, generating by a false action generating module PMG, wherein the generating process comprises the following steps:

selecting a static picture as an input image, and cutting out a sequence u' epsilon R ^N×L×L N and L are respectively the number of the sequence frames and the size of the image, and R represents the resolution of the sequence u; speed parameter s = (v) _x ，v _y ) Determining the moving distance of the cut frame in the horizontal and vertical directions; moving speed v in horizontal direction _x And a moving speed v in the vertical direction _y Selecting from the set S = { -K., -1,0,1, · K }; wherein K represents the maximum value of the speed;

for an input image of size H × W, the distance is moved

Is defined as follows:

randomly selecting a cropping start point p on the input image _start The end point of cutting is

Obtaining an image sequence u' through cutting; finally, the image sequence u 'is converted into an optical flow sequence u' _f And obtaining the optical flow diagram HDP frame which is not easy to decompose.

4. The method for detecting video masquerading and salient objects based on decoupling self-supervision as claimed in claim 3, wherein the method for training adaptive frame routing mechanism AFR in step 1-2 comprises:

training the adaptive frame routing mechanism AFR using asymmetric losses, the asymmetric losses being defined as:

L _q (u)＝[(a+1) ^b -(a+u) ^b ]/b

wherein y represents the true label of a training sample,

representing a predictive label for a training sample.

5. The video camouflaging and salient object detection method based on decoupled self-supervision as claimed in claim 4, wherein the identification method in steps 1-3 comprises:

And corresponding optical flow sequence

representing a resolution size of a video frame;

And frames with insufficient motion information

6. The method for detecting video camouflaging and salient objects based on decoupling self-supervision as claimed in claim 5, wherein the method for constructing the motion segmentation network and the image segmentation network in step 1 comprises the following steps:

7. The method for video camouflaging and salient object detection based on decoupled self-supervision according to claim 6, wherein the motion segmentation network MS in steps 1-4 comprises three components: a convolutional neural network encoder for extracting a feature representation; generating a model for generating foreground and background representations; a CNN decoder for decoding the foreground and background representations to a final output;

for this feature F, the query vector is updated

A total of T times, wherein

Represents the query vector after the t-th update, q ∈ [0,1 ∈]Is a category relevant to the query embedding, 0 represents background, 1 represents foreground; initializing the query vector by using a random weight;

wherein, mu and sigma are mean and variance of Gaussian distribution, and d is the size of the weight vector; by Z ^(t) ∈R ^2×d To represent query vectors of all categories; in the course of the subsequent process,

and

as a whole Z ^(t) Are updated at the same time;

query vector Z ^(t+1) Using features F and Z ^(t) Updating is carried out;

Where the PE is a position vector of the position,

And

each layer is provided with three full-connection layers and a correction linear unit layer; calculate query value query, key value key using the self-attention mechanism:

query vector Z ^(t) Finally updated by the loop gate unit GRU to:

Z ^(t+1) ＝GRU(U ^(t) ，Z ^(t) )

wherein the content of the first and second substances,

a foreground query vector is represented that is,

representing a background query vector; during decoding, the two vectors are broadcast onto a spatial position-coded two-dimensional grid;

finally, the CNN decoder phi _dec Separately decode { O _f ，O _b To original resolution:

wherein the content of the first and second substances,

is the optical flow field that is reconstructed,

is a reconstructed foreground optical flow field,

is a reconstructed background light flow field; alpha is alpha ^fore Is the MASK image corresponding to the foreground optical flow field, alpha ^back Is a MASK image corresponding to the background light flow field; the final reconstructed graph can be expressed as:

wherein, for { alpha ^fore ，α ^back Use Softmax to ensure α ^fore +α ^back =1; the MS branch completes training in a self-supervision mode, and the loss function comprises reconstruction loss L _rec And entropy regularization loss L _ent ：

L _ent Is defined as:

L _ent ＝-(α ^fore ·log(α ^fore ))+(1-α ^back )·log(1-α ^back )

finally, the sequence X is obtained through the training mode _F Corresponding toResults O _F 。

8. The method for detecting video masquerading and salient objects based on decoupling self-supervision as claimed in claim 7, wherein the input of the image segmentation network CS in steps 1-5 is a video sequence X _R The output is O through a single image camouflaged object detection method or a single image salient object detection method _R 。

9. The method for detecting video masquerading and salient objects based on decoupling self-supervision as claimed in claim 8, wherein the method for training the video masquerading and salient object detection model based on decoupling self-supervision in step 2 comprises:

the results of the motion segmentation network MS and the image segmentation network CS are mutually crossed and supervised, so that the network can gradually generate a complete and accurate camouflage and saliency object map, and the final network model parameters are stored after the network is trained repeatedly for multiple rounds.

10. The method for detecting video masquerading and salient objects based on decoupling and self-supervision as claimed in claim 9, wherein the method for inputting the target video to be detected into the trained decoupling and self-supervision video masquerading and salient object detection model in step 3 comprises: and inputting the target image to be detected into a trained decoupling self-supervision video camouflage and salient object detection model for reasoning to obtain corresponding camouflage and salient object segmentation images.