CN113269021B

CN113269021B - Non-supervision video target segmentation method based on local global memory mechanism

Info

Publication number: CN113269021B
Application number: CN202110293554.5A
Authority: CN
Inventors: 段立娟; 恩擎; 王文健; 张文博
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2024-03-01
Anticipated expiration: 2041-03-18
Also published as: CN113269021A

Abstract

The invention discloses an unsupervised video target segmentation method based on a local global memory mechanism, and belongs to the technical field of feature learning and video target segmentation. The method first extracts embedded features of a pair of identical videos. Then selecting the global memory candidate frame in the video, and extracting the global memory candidate feature. And each global memory candidate feature corresponds to each node corresponding to the graph convolution network, and the global memory feature expression is enhanced. Mutual attention information between a pair of frames is extracted through a local memory module, and the mutual attention is alternately regarded as a target and a search role in an attention mechanism to enhance mutual attention. Finally, a predicted target mask is obtained through a decoder, the loss is calculated by using the cross entropy loss, and the whole model is updated, so that a final segmentation network is obtained. The invention considers the local and global memory mechanism simultaneously, and obtains reliable short-time and long-time video inter-frame correlation information simultaneously, thereby realizing the non-supervision video target segmentation.

Description

Non-supervision video target segmentation method based on local global memory mechanism

Technical Field

The invention relates to the field of deep learning and the field of weak supervision video target segmentation, in particular to a feature expression method in non-supervision video target segmentation, which can obtain more accurate segmentation results on a video target segmentation dataset.

Background

With the development of visual big data technology, video information has become an important information transmission medium, and the information carried by the video information includes both a spatial layer and a temporal layer. How to obtain valuable scene object information from the space-time carrier has become a major issue in the development of computer vision today. The existing video information target analysis task brings convenience and attention to society and also brings certain challenges. For example, how to segment foreground objects by using limited categories without online specification of objects to be segmented, thereby being applied to video scene monitoring and parsing tasks. How to extract feature expression capability that considers both global and local spatiotemporal features to enhance video is one of the important ideas for performing the above analysis. Unsupervised video object segmentation, which lacks online guidance information, has been widely studied and advanced in the relevant field over the past several years. Many research institutions, corporations and universities have been dedicated to solving this problem, which also means that this task has become the main stream of research. The non-supervision video target segmentation task has high application value, and can assist in screening key targets in the aspect of security monitoring; can help focus on significantly moving vehicles in terms of traffic hub monitoring; driving may be assisted in terms of autopilot.

Humans have excellent visual memory ability, and can simultaneously memorize the gist of a scene and the details of an image. In addition to quantifying visual memory into symbolized markers with a memorable representation, humans often discard large amounts of redundant visual memory. In this process again, visual memory is treated unequally. Inspiring from a cognitive and memory model, i.e. long-term memory can store a great deal of target details, it is reasonable to combine global memory and local memory to put forward a unified neural network model.

The non-supervision video object segmentation aims at confirming that an object exists in a video sequence and obtaining a corresponding segmentation mask thereof under the condition that only a training set exists in a segmentation label. This task is one of the most important tasks in video tasks. The greatest challenge encountered with this task is to address the rapid movement, movement blurring problems that occur in video tasks, and appearance inconsistencies that exist in video sequences. The method mainly solves the problem that the non-supervision video segmentation task is carried out under the guidance of only the training set segmentation label but lacking the first frame of the test set. Some related work leverages the link from consecutive inter-frame times to accomplish this task. These methods typically use optical flow information to construct short-term correlation information and recurrent neural networks to construct long-term correlation information. Optical flow is typically trained offline through a virtual dataset, and optical flow can become unreliable in the face of complex motion states in real data. Recurrent neural networks are prone to problems that are difficult to optimize when determining the correlation of an entire video sequence. In addition, some related work suggests using a twin network to solve this problem by establishing dense correlations for different pairs of video frames. Although this class of methods can achieve good results, this class of methods lacks global guidance information and results are relatively poor when faced with large changes in appearance.

Based on the above analysis, two kinds of observation on real scenes drive the proposal of the method: 1) The main targets in video are constantly present in the video sequence. The common target information existing between different frames of the same video has guiding significance for completing target segmentation by the model, and the problem is also prompted to be regarded as a mutual segmentation problem by the chapter. 2) To understand video information, humans need to wake up memories that are very relevant to the current semantics. And performing target segmentation on the current video according to the prior knowledge, and extracting the memory information related to the current frame from the historical segmentation information to obtain global guidance. Thus in the object segmentation process of video sequences, long-term dependencies can be established for local memory by referencing common primary object correlation information between different frames and the current frame of the same video sequence. Furthermore, for global memory, the historic memory objective can be expressed by a space-time graph model, each node of which is represented by a historic memory feature. Thus, the model proposed by the present method inspires on the perceptibility of the biologically sensed surrounding environment.

Disclosure of Invention

The invention aims to provide an unsupervised video target segmentation method based on a local global memory mechanism, aiming at the defects that the existing time information is unreliable and difficult to converge and the twin network lacks global guiding information. According to the invention, the correlation of different frame characteristics of the same video is calculated and used as local memory information, and the existing segmentation result characteristics are stored and selected as global memory information through a graph-coiler network, so that the purpose of emphasizing the current frame characteristic information is achieved, and the model learns different granularity characteristics. The local memory module calculates correlation information between a pair of frames to obtain local memory information; the global memory module constructs a graph neural network from the historical segmentation information of the current video sequence, and acquires the global memory information by updating the graph neural network. And finally, obtaining a final segmentation result by enhancing the current frame information by utilizing the local-global memory at the same time. Compared with related work, the method has the advantages that the local and global memory mechanisms are considered simultaneously, and reliable short-time and long-time video inter-frame correlation information is obtained simultaneously, so that the non-supervision video target segmentation is realized.

The main idea for realizing the method is as follows: the method comprises the steps of firstly randomly selecting a pair of video frames from the same video sequence, and inputting the pair of video frames into a coding network to obtain embedded features corresponding to the pair of video frames. And then writing the global candidate memory sample into a global memory module, extracting global memory features through a feature extraction network to form an external global memory table, and inputting the external global memory table into a constructed graph neural network to obtain the enhanced external global memory features. Features enhanced by global features are generated by inputting a pair of embedded features of the video frame and external global memory features, and then input to a local memory module to obtain features enhanced by global memory and local memory, and input to a decoder to obtain final prediction segmentation results. The cross entropy penalty is used to calculate the penalty and update the entire model to arrive at the final segmentation network.

According to the main thought, the invention comprises an unsupervised video target segmentation network training stage and an unsupervised video target segmentation network actual measurement stage,

the training phase comprises the following steps:

step 1: constructing a dataset

Constructing a data set, taking a video frame as input, taking a segmentation mask corresponding to the video frame as a training target, and constructing a target segmentation image set corresponding to the training video;

step 2: extracting current video frame I respectively using encoder ^a Embedded feature X of (2) ^a And random one frame I of the same video ^b Embedded feature X of (2) ^b ；

Step 3: extracting global memory foreground features

Randomly selecting N frames in the current video as global candidate memories, extracting a global memory mask and multiplying the global memory mask with the global candidate memories to obtain global candidate memory prospects; inputting the global candidate memory foreground into a feature extractor to obtain global candidate memory foreground features;

step 4: acquiring global memory map neural network

Using the global candidate memory foreground feature obtained in the step 3 as each node corresponding to the global memory map neural network, and using the similarity between each node as the connection weight between the nodes, thereby obtaining the global memory map neural network;

step 5: updating global memory features

Performing feature propagation operation on the global memory map neural network constructed in the step 4 through a convolution map neural network to obtain an updated global memory feature Z ^a ；

Step 6: reading global memory features

Updating the node characteristic Z obtained in the step 5 after information updating ^a Respectively with the embedded features X obtained in the step 2 ^a ，X ^b Performing non-local attention operations to obtain global memory enhancement features

Step 7: global memory enhancement feature using step 6Mutually enhancing to obtain the feature ∈earth enhanced by local global enhancement>

Step 8: decoding to generate a prediction mask

The features obtained in the step 7 and subjected to local global enhancement are subjected toRespectively input to a decoder for decoding to obtain prediction masks +.>

Step 9: calculating segmentation loss and updating segmentation network parameters

Calculating loss by using the prediction mask and the segmentation target in the step 8, and back-propagating and updating the segmentation network weight until convergence to obtain an unsupervised video target segmentation network;

non-supervision video target segmentation network actual measurement stage

Step 10: outputting the non-supervision video target segmentation result

And (3) taking the video frame to be analyzed and any frame in the video as input, and repeating the steps 2-8 to obtain a final target segmentation result.

Compared with the prior art, the invention has the following obvious advantages and beneficial effects: the invention provides an unsupervised video target segmentation method based on a local global memory mechanism. The method fully inspires a memory mechanism of human beings, calculates the correlation of different frame characteristics of the same video as local memory information, stores and selects the existing segmentation result characteristics as global memory information through a graph-coiler network, achieves the aim of emphasizing the characteristic information of the current frame, and enables a model to learn different granularity characteristics. The method starts from the capability of human beings to memorize the key points and the details simultaneously, and carries out memory mechanism modeling from macroscopic to microscopic angles, thereby completing the task of unsupervised video segmentation.

Drawings

FIG. 1 is a general flow chart of a method according to the present invention;

FIG. 2 is a diagram of the overall architecture of the algorithm according to the present invention;

Detailed Description

The invention will be further described in detail below with reference to specific examples and with reference to the detailed drawings, in order to make the objects, technical solutions and advantages of the invention more apparent. The described embodiments are only intended to facilitate an understanding of the invention and do not serve as a limitation. FIG. 1 is a flow chart of the method of the present invention, as shown in FIG. 1, comprising the steps of:

training phase

Step 1: constructing a dataset

The database in the implementation process of the method is derived from the DAVIS2016 of the public video target segmentation standard data set. Wherein DAVIS-2016 consists of a high quality video sequence of 50 categories, together with 3455 densely masked video frames. Of which 30 categories served as training and 20 categories served as tests. Video frames constructing training sets and corresponding target segmentation labels are as followsWherein I is ^t Representing RGB images of video frames, Y ^t Representation I ^t And (5) corresponding segmentation labels.

Step 2: extracting embedded features of a current video frame and a random frame of the same video

From step 1Is represented as I ^a And I ^b Respectively extracting embedded features X corresponding to two frames by using an Encoder Encoder ^a And X ^b ：

X ^a ＝Encoder(I ^a )

X ^b ＝Encoder(I ^b )

Wherein I is ^a ，X ^a ，/>H and W represent the height and width of the input image, c represents the number of channels of the embedded feature, and H and W represent the height and width of the embedded feature; the choice of all the encoders involved in the present invention is not limited, any convolutional neural network structure can be adopted, and table 1 is only an implementation choice structure.

Step 3: extracting global candidate memory foreground features

Selecting and current video frame I ^a N video frames of the same video sequencen＝1，…，N：

Wherein E is ^a Representing a current video frame I ^a And corresponding global candidate memories. After extracting the embedded features of each global candidate memory using the Encoder Encoder, input to the global memory decoder D _gl Obtaining a global memory mask, and performing dot multiplication operation with the corresponding global candidate memory to obtain the prospect of the global candidate memory

In the implementation process, the selection of the global memory decoder is not limited, any convolutional neural network structure can be adopted, and table 2 is only used as an implementation selection structure. From this, N global candidate remembers are possible:

will beEach element of the set is input into a feature extractor to extract features, global candidate memory foreground features are obtained, the formula is as follows,

wherein the method comprises the steps ofRepresenting a video frame corresponding to a current video frame I ^a Is>Is characterized by (2); the feature extractor and the encoder adopt the same network structure.

Step 4: acquiring a global memory map neural network, wherein the global memory map neural network is specifically represented by a node V and a regularized global candidate memory foreground feature similarity matrix;

global memory graph G (V, E), defining nodes as V and defining edges as E; in this figure, each pair of global candidate memory foreground features are connected by an edge, the larger the value of one edge to the other indicates the more similar the meaning of the two features. Current video frame I ^a The set of foreground features of N global candidate memories is defined asThe similarity between each node is calculated as:

wherein i and j represent V ^a The ith and jth elements of (a) are included; after obtaining the foreground feature similarity matrix of the global candidate memory, carrying out regularization operation on each row of the foreground feature similarity matrix so that the feature sum connected to the ith element is 1, and specifically carrying out regularization operation by using a softmax method:

wherein,the adjacency matrix, which is regarded as expressing the foreground features of each global candidate memory, can express the relationship between each global candidate memory.

Step 5: global memory information update

Node V and adjacency matrix in global memory map neural networkIn the input convolution graph neural network, the convolution graph neural network is formed by connecting multiple layers of nonlinear operation, wherein the output of the former layer is used as the input of the latter layer, and the nonlinear operation is specifically as follows:

wherein the method comprises the steps ofRepresenting adjacency matrix->Representing the n-th layer weight, reLu represents a nonlinear operation. Vn represents the input of the n-th layer, i.e. the output of the n-1 th layer, the input of the first layer being the node V ^a The latitude of the input characteristic is N x d, Z ⁿ The output of the nth layer is denoted as Za, the final output is characterized by latitude +.>In this embodiment, two layers of nonlinear operation connection are selected.

Step 6: reading global memory features

For the current frame I ^a Its characteristic obtained by the coding layer is represented as X ^a . First, calculate the similarity of the memory in the embedded space with the global candidate

Wherein θ is _gl Andrepresenting the global memory transfer function, sub represents a downsampling method for the purpose of reducing computational overhead. The similarity matrix learns X simultaneously ^a And Z ^a State and relationship of (a). Then for F (X ^a ，Z ^a ) Is regularized using the softmax method:

since the high-dimensional embedded features produce a huge amount of computation when computed by matrix multiplication, the output of softmax is typically located in a small gradient region. To overcome the above drawbacks, this section will have F (X ^a ，Z ^a ) Scaling to the original value by performing a scaling operationMultiple of->Finally, this section uses the regularized similarity matrix to enhance the embedded feature X ^a Obtaining global memory enhancement feature->

Wherein psi is _gl And g _gl The global memory transfer functions, respectively.

Need to be opposite to X ^b Feature enhancement generation is also performed

Step 7: acquiring and reading local memory features

The global memory enhancement feature obtained in step 6 can only perform feature enhancement from a macroscopic level, but lacks microscopic detailed information, so local memory features need to be acquired.

For the followingFirst calculate AND +.>Similarity matrix->

Wherein θ is _lo Andrepresenting the local memory transfer function, sub represents a downsampling method for the purpose of reducing computational overhead. Subsequently for->Is regularized using the softmax method:

since the high-dimensional embedded features produce a huge amount of computation when computed by matrix multiplication, the output of softmax is typically located in a small gradient region. To overcome the above drawbacks, this section willScaling to the original value by performing a scaling operation>Multiple of->Finally, this section uses the regularized similarity matrix to enhance the global embedded feature X ^a Obtain->

Wherein psi is _lo And g _lo The global memory transfer functions, respectively.

For a pair ofFeature enhancement generation->

Step 8: decoding to generate a prediction mask

Features are then to be embeddedInput memory decoder D _mem Generating prediction masks respectively->

Where σ represents a sigmoid function,and->Respectively represent I ^a And I ^b Is used to predict the masking result. In practice, the memory decoder structure is shown in the accompanying table 2.

The model is trained by binary cross entropy loss function:

wherein Y is ^a ∈{0，1} ^W*H And Y ^b ∈{0，1} ^W*H Representing video frame I ^a And I ^b Corresponding labeling information.And->Respectively represent I ^a And I ^b Video segmentation network proposed by using this chapter respectively +.>And (5) obtaining a prediction result. T represents the total number of data set samples; a denotes a video frame index, and b=idx (cat (a)) denotes an index of a video frame of the same category as a.

Actual measurement stage

Step 10: outputting the non-supervision video target segmentation result

It can be seen from the accompanying table 3 that the method proposed by the present invention has a better segmentation effect on the video object segmentation dataset than the latest method.

TABLE 1

TABLE 2

TABLE 3 Table 3

Claims

1. The non-supervision video target segmentation method based on the local global memory mechanism is characterized by comprising the following steps of: comprises an unsupervised video target segmentation network training stage and an unsupervised video target segmentation network actual measurement stage,

the training phase comprises the following steps:

step 1: constructing a dataset

Step 3: extracting global memory foreground features

step 4: acquiring global memory map neural network

step 5: updating global memory features

Step 6: reading global memory features

Step 8: decoding to generate a prediction mask

non-supervision video target segmentation network actual measurement stage

Step 10: outputting the non-supervision video target segmentation result

2. The non-supervised video object segmentation method based on local global memory mechanisms as set forth in claim 1, wherein: the step 3 is specifically as follows,

selecting and current video frame I ^a N video frames of the same video sequence

Wherein E is ^a Representing a current video frame I ^a The corresponding global candidate memories are input into a global memory decoder D after the embedded features of each global candidate memory are extracted by using an Encoder Encoder _gl Obtaining a global memory mask, and performing dot multiplication operation with the corresponding global candidate memory to obtain the prospect of the global candidate memory

From this, N global candidate remembers are possible:

3. The non-supervised video object segmentation method based on local global memory mechanisms as set forth in claim 1, wherein: the step 4 is specifically as follows,

the global memory map neural network is specifically represented by a node V and a regularized foreground feature similarity matrix of global candidate memories;

current video frame I ^a The set of foreground features of N global candidate memories is defined asThe similarity between each node is calculated as:

wherein,representing the relationship between each global candidate memory.

4. The non-supervised video object segmentation method based on local global memory mechanisms as set forth in claim 1, wherein: the step 5 is specifically as follows,

node V and adjacency matrix in global memory map neural networkIn the input convolution graph neural network, the convolution graph neural network is formed by connecting groups through multiple layers of nonlinear operationsThe output of the former layer is used as the input of the latter layer, and the nonlinear operation is specifically as follows:

wherein the method comprises the steps ofRepresenting adjacency matrix->Represents the weight of the nth layer, reLu represents nonlinear operation, V ⁿ Representing the input of the n-th layer, i.e. the output of the n-1 th layer, the input of the first layer being the node V ^a ，Z ⁿ Representing the output of the nth layer, the last layer output being denoted Z ^a _。

5. The non-supervised video object segmentation method based on local global memory mechanisms as set forth in claim 1, wherein: in step 6The calculation process of (a) is as follows,

first, X is calculated ^a And Z is obtained in the step 5 ^a Similarity in embedding spaceThe calculation formula is as follows:

wherein θ is _gl Andrepresenting a global memory transfer function, sub representing a downsampling method;

then for F (X ^a ，Z ^a ) Is regularized using the softmax method per line of (c)And (3) treatment:

wherein the method comprises the steps ofc is Xa channel number;

finally, the regularized similarity matrix is used to enhance the embedded feature X ^a Obtaining global memory enhancement featuresThe calculation formula is as follows:

wherein psi is _gl And g _gl Respectively the global memory transfer function,

computing procedure and->The same except that ∈>X in the calculation process ^a Replaced by X ^b 。

6. The non-supervised video object segmentation method based on local global memory mechanisms as set forth in claim 1, wherein: the locally globally enhanced features described in step 7The calculation process of (a) is as follows,

first calculate and X ^b Similarity matrix of (c)

Wherein θ is _lo Andrepresenting a local memory transfer function, sub representing a downsampling method;

then for the followingIs regularized using the softmax method:

wherein the method comprises the steps ofc is->A channel number;

finally, the regularized similarity matrix is used to enhance the global embedded feature X ^a Obtaining

Wherein psi is _lo And g _lo Global memory transfer functions respectively;

the features that are locally and globally enhancedThe calculation process of (a) is as follows,

7. the non-supervised video object segmentation method based on local global memory mechanisms as set forth in claim 1, wherein: training by adopting a binary cross entropy loss function in the step 9:

wherein Y is ^a ∈{0，1} ^W*H And Y ^b ∈{0，1} ^W*H Representing video frame I ^a And I ^b Corresponding labeling information, wherein T represents the total number of data set samples; a denotes a video frame index, and b=idx (cat (a)) denotes an index of a video frame of the same category as a.