CN113269021B - Non-supervision video target segmentation method based on local global memory mechanism - Google Patents

Non-supervision video target segmentation method based on local global memory mechanism Download PDF

Info

Publication number
CN113269021B
CN113269021B CN202110293554.5A CN202110293554A CN113269021B CN 113269021 B CN113269021 B CN 113269021B CN 202110293554 A CN202110293554 A CN 202110293554A CN 113269021 B CN113269021 B CN 113269021B
Authority
CN
China
Prior art keywords
global
memory
video
global memory
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110293554.5A
Other languages
Chinese (zh)
Other versions
CN113269021A (en
Inventor
段立娟
恩擎
王文健
张文博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202110293554.5A priority Critical patent/CN113269021B/en
Publication of CN113269021A publication Critical patent/CN113269021A/en
Application granted granted Critical
Publication of CN113269021B publication Critical patent/CN113269021B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an unsupervised video target segmentation method based on a local global memory mechanism, and belongs to the technical field of feature learning and video target segmentation. The method first extracts embedded features of a pair of identical videos. Then selecting the global memory candidate frame in the video, and extracting the global memory candidate feature. And each global memory candidate feature corresponds to each node corresponding to the graph convolution network, and the global memory feature expression is enhanced. Mutual attention information between a pair of frames is extracted through a local memory module, and the mutual attention is alternately regarded as a target and a search role in an attention mechanism to enhance mutual attention. Finally, a predicted target mask is obtained through a decoder, the loss is calculated by using the cross entropy loss, and the whole model is updated, so that a final segmentation network is obtained. The invention considers the local and global memory mechanism simultaneously, and obtains reliable short-time and long-time video inter-frame correlation information simultaneously, thereby realizing the non-supervision video target segmentation.

Description

Non-supervision video target segmentation method based on local global memory mechanism
Technical Field
The invention relates to the field of deep learning and the field of weak supervision video target segmentation, in particular to a feature expression method in non-supervision video target segmentation, which can obtain more accurate segmentation results on a video target segmentation dataset.
Background
With the development of visual big data technology, video information has become an important information transmission medium, and the information carried by the video information includes both a spatial layer and a temporal layer. How to obtain valuable scene object information from the space-time carrier has become a major issue in the development of computer vision today. The existing video information target analysis task brings convenience and attention to society and also brings certain challenges. For example, how to segment foreground objects by using limited categories without online specification of objects to be segmented, thereby being applied to video scene monitoring and parsing tasks. How to extract feature expression capability that considers both global and local spatiotemporal features to enhance video is one of the important ideas for performing the above analysis. Unsupervised video object segmentation, which lacks online guidance information, has been widely studied and advanced in the relevant field over the past several years. Many research institutions, corporations and universities have been dedicated to solving this problem, which also means that this task has become the main stream of research. The non-supervision video target segmentation task has high application value, and can assist in screening key targets in the aspect of security monitoring; can help focus on significantly moving vehicles in terms of traffic hub monitoring; driving may be assisted in terms of autopilot.
Humans have excellent visual memory ability, and can simultaneously memorize the gist of a scene and the details of an image. In addition to quantifying visual memory into symbolized markers with a memorable representation, humans often discard large amounts of redundant visual memory. In this process again, visual memory is treated unequally. Inspiring from a cognitive and memory model, i.e. long-term memory can store a great deal of target details, it is reasonable to combine global memory and local memory to put forward a unified neural network model.
The non-supervision video object segmentation aims at confirming that an object exists in a video sequence and obtaining a corresponding segmentation mask thereof under the condition that only a training set exists in a segmentation label. This task is one of the most important tasks in video tasks. The greatest challenge encountered with this task is to address the rapid movement, movement blurring problems that occur in video tasks, and appearance inconsistencies that exist in video sequences. The method mainly solves the problem that the non-supervision video segmentation task is carried out under the guidance of only the training set segmentation label but lacking the first frame of the test set. Some related work leverages the link from consecutive inter-frame times to accomplish this task. These methods typically use optical flow information to construct short-term correlation information and recurrent neural networks to construct long-term correlation information. Optical flow is typically trained offline through a virtual dataset, and optical flow can become unreliable in the face of complex motion states in real data. Recurrent neural networks are prone to problems that are difficult to optimize when determining the correlation of an entire video sequence. In addition, some related work suggests using a twin network to solve this problem by establishing dense correlations for different pairs of video frames. Although this class of methods can achieve good results, this class of methods lacks global guidance information and results are relatively poor when faced with large changes in appearance.
Based on the above analysis, two kinds of observation on real scenes drive the proposal of the method: 1) The main targets in video are constantly present in the video sequence. The common target information existing between different frames of the same video has guiding significance for completing target segmentation by the model, and the problem is also prompted to be regarded as a mutual segmentation problem by the chapter. 2) To understand video information, humans need to wake up memories that are very relevant to the current semantics. And performing target segmentation on the current video according to the prior knowledge, and extracting the memory information related to the current frame from the historical segmentation information to obtain global guidance. Thus in the object segmentation process of video sequences, long-term dependencies can be established for local memory by referencing common primary object correlation information between different frames and the current frame of the same video sequence. Furthermore, for global memory, the historic memory objective can be expressed by a space-time graph model, each node of which is represented by a historic memory feature. Thus, the model proposed by the present method inspires on the perceptibility of the biologically sensed surrounding environment.
Disclosure of Invention
The invention aims to provide an unsupervised video target segmentation method based on a local global memory mechanism, aiming at the defects that the existing time information is unreliable and difficult to converge and the twin network lacks global guiding information. According to the invention, the correlation of different frame characteristics of the same video is calculated and used as local memory information, and the existing segmentation result characteristics are stored and selected as global memory information through a graph-coiler network, so that the purpose of emphasizing the current frame characteristic information is achieved, and the model learns different granularity characteristics. The local memory module calculates correlation information between a pair of frames to obtain local memory information; the global memory module constructs a graph neural network from the historical segmentation information of the current video sequence, and acquires the global memory information by updating the graph neural network. And finally, obtaining a final segmentation result by enhancing the current frame information by utilizing the local-global memory at the same time. Compared with related work, the method has the advantages that the local and global memory mechanisms are considered simultaneously, and reliable short-time and long-time video inter-frame correlation information is obtained simultaneously, so that the non-supervision video target segmentation is realized.
The main idea for realizing the method is as follows: the method comprises the steps of firstly randomly selecting a pair of video frames from the same video sequence, and inputting the pair of video frames into a coding network to obtain embedded features corresponding to the pair of video frames. And then writing the global candidate memory sample into a global memory module, extracting global memory features through a feature extraction network to form an external global memory table, and inputting the external global memory table into a constructed graph neural network to obtain the enhanced external global memory features. Features enhanced by global features are generated by inputting a pair of embedded features of the video frame and external global memory features, and then input to a local memory module to obtain features enhanced by global memory and local memory, and input to a decoder to obtain final prediction segmentation results. The cross entropy penalty is used to calculate the penalty and update the entire model to arrive at the final segmentation network.
According to the main thought, the invention comprises an unsupervised video target segmentation network training stage and an unsupervised video target segmentation network actual measurement stage,
the training phase comprises the following steps:
step 1: constructing a dataset
Constructing a data set, taking a video frame as input, taking a segmentation mask corresponding to the video frame as a training target, and constructing a target segmentation image set corresponding to the training video;
step 2: extracting current video frame I respectively using encoder a Embedded feature X of (2) a And random one frame I of the same video b Embedded feature X of (2) b
Step 3: extracting global memory foreground features
Randomly selecting N frames in the current video as global candidate memories, extracting a global memory mask and multiplying the global memory mask with the global candidate memories to obtain global candidate memory prospects; inputting the global candidate memory foreground into a feature extractor to obtain global candidate memory foreground features;
step 4: acquiring global memory map neural network
Using the global candidate memory foreground feature obtained in the step 3 as each node corresponding to the global memory map neural network, and using the similarity between each node as the connection weight between the nodes, thereby obtaining the global memory map neural network;
step 5: updating global memory features
Performing feature propagation operation on the global memory map neural network constructed in the step 4 through a convolution map neural network to obtain an updated global memory feature Z a
Step 6: reading global memory features
Updating the node characteristic Z obtained in the step 5 after information updating a Respectively with the embedded features X obtained in the step 2 a ,X b Performing non-local attention operations to obtain global memory enhancement features
Step 7: global memory enhancement feature using step 6Mutually enhancing to obtain the feature ∈earth enhanced by local global enhancement>
Step 8: decoding to generate a prediction mask
The features obtained in the step 7 and subjected to local global enhancement are subjected toRespectively input to a decoder for decoding to obtain prediction masks +.>
Step 9: calculating segmentation loss and updating segmentation network parameters
Calculating loss by using the prediction mask and the segmentation target in the step 8, and back-propagating and updating the segmentation network weight until convergence to obtain an unsupervised video target segmentation network;
non-supervision video target segmentation network actual measurement stage
Step 10: outputting the non-supervision video target segmentation result
And (3) taking the video frame to be analyzed and any frame in the video as input, and repeating the steps 2-8 to obtain a final target segmentation result.
Compared with the prior art, the invention has the following obvious advantages and beneficial effects: the invention provides an unsupervised video target segmentation method based on a local global memory mechanism. The method fully inspires a memory mechanism of human beings, calculates the correlation of different frame characteristics of the same video as local memory information, stores and selects the existing segmentation result characteristics as global memory information through a graph-coiler network, achieves the aim of emphasizing the characteristic information of the current frame, and enables a model to learn different granularity characteristics. The method starts from the capability of human beings to memorize the key points and the details simultaneously, and carries out memory mechanism modeling from macroscopic to microscopic angles, thereby completing the task of unsupervised video segmentation.
Drawings
FIG. 1 is a general flow chart of a method according to the present invention;
FIG. 2 is a diagram of the overall architecture of the algorithm according to the present invention;
Detailed Description
The invention will be further described in detail below with reference to specific examples and with reference to the detailed drawings, in order to make the objects, technical solutions and advantages of the invention more apparent. The described embodiments are only intended to facilitate an understanding of the invention and do not serve as a limitation. FIG. 1 is a flow chart of the method of the present invention, as shown in FIG. 1, comprising the steps of:
training phase
Step 1: constructing a dataset
The database in the implementation process of the method is derived from the DAVIS2016 of the public video target segmentation standard data set. Wherein DAVIS-2016 consists of a high quality video sequence of 50 categories, together with 3455 densely masked video frames. Of which 30 categories served as training and 20 categories served as tests. Video frames constructing training sets and corresponding target segmentation labels are as followsWherein I is t Representing RGB images of video frames, Y t Representation I t And (5) corresponding segmentation labels.
Step 2: extracting embedded features of a current video frame and a random frame of the same video
From step 1Is represented as I a And I b Respectively extracting embedded features X corresponding to two frames by using an Encoder Encoder a And X b
X a =Encoder(I a )
X b =Encoder(I b )
Wherein I is aX a ,/>H and W represent the height and width of the input image, c represents the number of channels of the embedded feature, and H and W represent the height and width of the embedded feature; the choice of all the encoders involved in the present invention is not limited, any convolutional neural network structure can be adopted, and table 1 is only an implementation choice structure.
Step 3: extracting global candidate memory foreground features
Selecting and current video frame I a N video frames of the same video sequencen=1,…,N:
Wherein E is a Representing a current video frame I a And corresponding global candidate memories. After extracting the embedded features of each global candidate memory using the Encoder Encoder, input to the global memory decoder D gl Obtaining a global memory mask, and performing dot multiplication operation with the corresponding global candidate memory to obtain the prospect of the global candidate memory
In the implementation process, the selection of the global memory decoder is not limited, any convolutional neural network structure can be adopted, and table 2 is only used as an implementation selection structure. From this, N global candidate remembers are possible:
will beEach element of the set is input into a feature extractor to extract features, global candidate memory foreground features are obtained, the formula is as follows,
wherein the method comprises the steps ofRepresenting a video frame corresponding to a current video frame I a Is>Is characterized by (2); the feature extractor and the encoder adopt the same network structure.
Step 4: acquiring a global memory map neural network, wherein the global memory map neural network is specifically represented by a node V and a regularized global candidate memory foreground feature similarity matrix;
global memory graph G (V, E), defining nodes as V and defining edges as E; in this figure, each pair of global candidate memory foreground features are connected by an edge, the larger the value of one edge to the other indicates the more similar the meaning of the two features. Current video frame I a The set of foreground features of N global candidate memories is defined asThe similarity between each node is calculated as:
wherein i and j represent V a The ith and jth elements of (a) are included; after obtaining the foreground feature similarity matrix of the global candidate memory, carrying out regularization operation on each row of the foreground feature similarity matrix so that the feature sum connected to the ith element is 1, and specifically carrying out regularization operation by using a softmax method:
wherein,the adjacency matrix, which is regarded as expressing the foreground features of each global candidate memory, can express the relationship between each global candidate memory.
Step 5: global memory information update
Node V and adjacency matrix in global memory map neural networkIn the input convolution graph neural network, the convolution graph neural network is formed by connecting multiple layers of nonlinear operation, wherein the output of the former layer is used as the input of the latter layer, and the nonlinear operation is specifically as follows:
wherein the method comprises the steps ofRepresenting adjacency matrix->Representing the n-th layer weight, reLu represents a nonlinear operation. Vn represents the input of the n-th layer, i.e. the output of the n-1 th layer, the input of the first layer being the node V a The latitude of the input characteristic is N x d, Z n The output of the nth layer is denoted as Za, the final output is characterized by latitude +.>In this embodiment, two layers of nonlinear operation connection are selected.
Step 6: reading global memory features
For the current frame I a Its characteristic obtained by the coding layer is represented as X a . First, calculate the similarity of the memory in the embedded space with the global candidate
Wherein θ is gl Andrepresenting the global memory transfer function, sub represents a downsampling method for the purpose of reducing computational overhead. The similarity matrix learns X simultaneously a And Z a State and relationship of (a). Then for F (X a ,Z a ) Is regularized using the softmax method:
since the high-dimensional embedded features produce a huge amount of computation when computed by matrix multiplication, the output of softmax is typically located in a small gradient region. To overcome the above drawbacks, this section will have F (X a ,Z a ) Scaling to the original value by performing a scaling operationMultiple of->Finally, this section uses the regularized similarity matrix to enhance the embedded feature X a Obtaining global memory enhancement feature->
Wherein psi is gl And g gl The global memory transfer functions, respectively.
Need to be opposite to X b Feature enhancement generation is also performed
Step 7: acquiring and reading local memory features
The global memory enhancement feature obtained in step 6 can only perform feature enhancement from a macroscopic level, but lacks microscopic detailed information, so local memory features need to be acquired.
For the followingFirst calculate AND +.>Similarity matrix->
Wherein θ is lo Andrepresenting the local memory transfer function, sub represents a downsampling method for the purpose of reducing computational overhead. Subsequently for->Is regularized using the softmax method:
since the high-dimensional embedded features produce a huge amount of computation when computed by matrix multiplication, the output of softmax is typically located in a small gradient region. To overcome the above drawbacks, this section willScaling to the original value by performing a scaling operation>Multiple of->Finally, this section uses the regularized similarity matrix to enhance the global embedded feature X a Obtain->
Wherein psi is lo And g lo The global memory transfer functions, respectively.
For a pair ofFeature enhancement generation->
Step 8: decoding to generate a prediction mask
Features are then to be embeddedInput memory decoder D mem Generating prediction masks respectively->
Where σ represents a sigmoid function,and->Respectively represent I a And I b Is used to predict the masking result. In practice, the memory decoder structure is shown in the accompanying table 2.
Step 9: calculating segmentation loss and updating segmentation network parameters
The model is trained by binary cross entropy loss function:
wherein Y is a ∈{0,1} W*H And Y b ∈{0,1} W*H Representing video frame I a And I b Corresponding labeling information.And->Respectively represent I a And I b Video segmentation network proposed by using this chapter respectively +.>And (5) obtaining a prediction result. T represents the total number of data set samples; a denotes a video frame index, and b=idx (cat (a)) denotes an index of a video frame of the same category as a.
Actual measurement stage
Step 10: outputting the non-supervision video target segmentation result
And (3) taking the video frame to be analyzed and any frame in the video as input, and repeating the steps 2-8 to obtain a final target segmentation result.
It can be seen from the accompanying table 3 that the method proposed by the present invention has a better segmentation effect on the video object segmentation dataset than the latest method.
TABLE 1
TABLE 2
TABLE 3 Table 3

Claims (7)

1. The non-supervision video target segmentation method based on the local global memory mechanism is characterized by comprising the following steps of: comprises an unsupervised video target segmentation network training stage and an unsupervised video target segmentation network actual measurement stage,
the training phase comprises the following steps:
step 1: constructing a dataset
Constructing a data set, taking a video frame as input, taking a segmentation mask corresponding to the video frame as a training target, and constructing a target segmentation image set corresponding to the training video;
step 2: extracting current video frame I respectively using encoder a Embedded feature X of (2) a And random one frame I of the same video b Embedded feature X of (2) b
Step 3: extracting global memory foreground features
Randomly selecting N frames in the current video as global candidate memories, extracting a global memory mask and multiplying the global memory mask with the global candidate memories to obtain global candidate memory prospects; inputting the global candidate memory foreground into a feature extractor to obtain global candidate memory foreground features;
step 4: acquiring global memory map neural network
Using the global candidate memory foreground feature obtained in the step 3 as each node corresponding to the global memory map neural network, and using the similarity between each node as the connection weight between the nodes, thereby obtaining the global memory map neural network;
step 5: updating global memory features
Performing feature propagation operation on the global memory map neural network constructed in the step 4 through a convolution map neural network to obtain an updated global memory feature Z a
Step 6: reading global memory features
Updating the node characteristic Z obtained in the step 5 after information updating a Respectively with the embedded features X obtained in the step 2 a ,X b Performing non-local attention operations to obtain global memory enhancement features
Step 7: global memory enhancement feature using step 6Mutually enhancing to obtain the feature ∈earth enhanced by local global enhancement>
Step 8: decoding to generate a prediction mask
The features obtained in the step 7 and subjected to local global enhancement are subjected toRespectively input to a decoder for decoding to obtain prediction masks +.>
Step 9: calculating segmentation loss and updating segmentation network parameters
Calculating loss by using the prediction mask and the segmentation target in the step 8, and back-propagating and updating the segmentation network weight until convergence to obtain an unsupervised video target segmentation network;
non-supervision video target segmentation network actual measurement stage
Step 10: outputting the non-supervision video target segmentation result
And (3) taking the video frame to be analyzed and any frame in the video as input, and repeating the steps 2-8 to obtain a final target segmentation result.
2. The non-supervised video object segmentation method based on local global memory mechanisms as set forth in claim 1, wherein: the step 3 is specifically as follows,
selecting and current video frame I a N video frames of the same video sequence
Wherein E is a Representing a current video frame I a The corresponding global candidate memories are input into a global memory decoder D after the embedded features of each global candidate memory are extracted by using an Encoder Encoder gl Obtaining a global memory mask, and performing dot multiplication operation with the corresponding global candidate memory to obtain the prospect of the global candidate memory
From this, N global candidate remembers are possible:
will beEach element of the set is input into a feature extractor to extract features, global candidate memory foreground features are obtained, the formula is as follows,
wherein the method comprises the steps ofRepresenting a video frame corresponding to a current video frame I a Is>Is characterized by (2); the feature extractor and the encoder adopt the same network structure.
3. The non-supervised video object segmentation method based on local global memory mechanisms as set forth in claim 1, wherein: the step 4 is specifically as follows,
the global memory map neural network is specifically represented by a node V and a regularized foreground feature similarity matrix of global candidate memories;
current video frame I a The set of foreground features of N global candidate memories is defined asThe similarity between each node is calculated as:
wherein i and j represent V a The ith and jth elements of (a) are included; after obtaining the foreground feature similarity matrix of the global candidate memory, carrying out regularization operation on each row of the foreground feature similarity matrix so that the feature sum connected to the ith element is 1, and specifically carrying out regularization operation by using a softmax method:
wherein,representing the relationship between each global candidate memory.
4. The non-supervised video object segmentation method based on local global memory mechanisms as set forth in claim 1, wherein: the step 5 is specifically as follows,
node V and adjacency matrix in global memory map neural networkIn the input convolution graph neural network, the convolution graph neural network is formed by connecting groups through multiple layers of nonlinear operationsThe output of the former layer is used as the input of the latter layer, and the nonlinear operation is specifically as follows:
wherein the method comprises the steps ofRepresenting adjacency matrix->Represents the weight of the nth layer, reLu represents nonlinear operation, V n Representing the input of the n-th layer, i.e. the output of the n-1 th layer, the input of the first layer being the node V a ,Z n Representing the output of the nth layer, the last layer output being denoted Z a
5. The non-supervised video object segmentation method based on local global memory mechanisms as set forth in claim 1, wherein: in step 6The calculation process of (a) is as follows,
first, X is calculated a And Z is obtained in the step 5 a Similarity in embedding spaceThe calculation formula is as follows:
wherein θ is gl Andrepresenting a global memory transfer function, sub representing a downsampling method;
then for F (X a ,Z a ) Is regularized using the softmax method per line of (c)And (3) treatment:
wherein the method comprises the steps ofc is Xa channel number;
finally, the regularized similarity matrix is used to enhance the embedded feature X a Obtaining global memory enhancement featuresThe calculation formula is as follows:
wherein psi is gl And g gl Respectively the global memory transfer function,
computing procedure and->The same except that ∈>X in the calculation process a Replaced by X b
6. The non-supervised video object segmentation method based on local global memory mechanisms as set forth in claim 1, wherein: the locally globally enhanced features described in step 7The calculation process of (a) is as follows,
first calculate and X b Similarity matrix of (c)
Wherein θ is lo Andrepresenting a local memory transfer function, sub representing a downsampling method;
then for the followingIs regularized using the softmax method:
wherein the method comprises the steps ofc is->A channel number;
finally, the regularized similarity matrix is used to enhance the global embedded feature X a Obtaining
Wherein psi is lo And g lo Global memory transfer functions respectively;
the features that are locally and globally enhancedThe calculation process of (a) is as follows,
7. the non-supervised video object segmentation method based on local global memory mechanisms as set forth in claim 1, wherein: training by adopting a binary cross entropy loss function in the step 9:
wherein Y is a ∈{0,1} W*H And Y b ∈{0,1} W*H Representing video frame I a And I b Corresponding labeling information, wherein T represents the total number of data set samples; a denotes a video frame index, and b=idx (cat (a)) denotes an index of a video frame of the same category as a.
CN202110293554.5A 2021-03-18 2021-03-18 Non-supervision video target segmentation method based on local global memory mechanism Active CN113269021B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110293554.5A CN113269021B (en) 2021-03-18 2021-03-18 Non-supervision video target segmentation method based on local global memory mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110293554.5A CN113269021B (en) 2021-03-18 2021-03-18 Non-supervision video target segmentation method based on local global memory mechanism

Publications (2)

Publication Number Publication Date
CN113269021A CN113269021A (en) 2021-08-17
CN113269021B true CN113269021B (en) 2024-03-01

Family

ID=77228336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110293554.5A Active CN113269021B (en) 2021-03-18 2021-03-18 Non-supervision video target segmentation method based on local global memory mechanism

Country Status (1)

Country Link
CN (1) CN113269021B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968150A (en) * 2020-08-19 2020-11-20 中国科学技术大学 Weak surveillance video target segmentation method based on full convolution neural network
CN112037239A (en) * 2020-08-28 2020-12-04 大连理工大学 Text guidance image segmentation method based on multi-level explicit relation selection

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9471925B2 (en) * 2005-09-14 2016-10-18 Millennial Media Llc Increasing mobile interactivity

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968150A (en) * 2020-08-19 2020-11-20 中国科学技术大学 Weak surveillance video target segmentation method based on full convolution neural network
CN112037239A (en) * 2020-08-28 2020-12-04 大连理工大学 Text guidance image segmentation method based on multi-level explicit relation selection

Also Published As

Publication number Publication date
CN113269021A (en) 2021-08-17

Similar Documents

Publication Publication Date Title
Chen et al. Learning linear regression via single-convolutional layer for visual object tracking
CN112651998B (en) Human body tracking algorithm based on attention mechanism and double-flow multi-domain convolutional neural network
Zhou et al. SSDA-YOLO: Semi-supervised domain adaptive YOLO for cross-domain object detection
CN110826698A (en) Method for embedding and representing crowd moving mode through context-dependent graph
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN112927266B (en) Weak supervision time domain action positioning method and system based on uncertainty guide training
CN111723667A (en) Human body joint point coordinate-based intelligent lamp pole crowd behavior identification method and device
Zhou et al. Learning with annotation of various degrees
Šarić et al. Single level feature-to-feature forecasting with deformable convolutions
CN115131565B (en) Histological image segmentation model based on semi-supervised learning
CN114692732A (en) Method, system, device and storage medium for updating online label
Cheng et al. Visual tracking via auto-encoder pair correlation filter
Zhai et al. FPANet: feature pyramid attention network for crowd counting
CN116258990A (en) Cross-modal affinity-based small sample reference video target segmentation method
Zhu et al. Multiscale temporal network for continuous sign language recognition
Cao et al. A dual attention model based on probabilistically mask for 3D human motion prediction
Zhang et al. Cross-domain attention network for unsupervised domain adaptation crowd counting
CN113269021B (en) Non-supervision video target segmentation method based on local global memory mechanism
CN116148864A (en) Radar echo extrapolation method based on DyConvGRU and Unet prediction refinement structure
CN117036760A (en) Multi-view clustering model implementation method based on graph comparison learning
Wang et al. Temporal consistent portrait video segmentation
CN115375732A (en) Unsupervised target tracking method and system based on module migration
CN113920170B (en) Pedestrian track prediction method, system and storage medium combining scene context and pedestrian social relationship
CN114399901B (en) Method and equipment for controlling traffic system
Fu et al. Relay knowledge distillation for efficiently boosting the performance of shallow networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant